Full Code of salesforce/LAVIS for AI

main 506965b9c4a1 cached

1383 files

52.6 MB

8.5M tokens

4547 symbols

1 requests

Copy disabled (too large) Download .txt

Showing preview only (33,856K chars total). Download the full file to get everything.

Repository: salesforce/LAVIS
Branch: main
Commit: 506965b9c4a1
Files: 1383
Total size: 52.6 MB

Directory structure:
gitextract__ei5npya/

├── .github/
│   └── workflows/
│       └── docs.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── LICENSE.txt
├── MANIFEST.in
├── README.md
├── SECURITY.md
├── app/
│   ├── __init__.py
│   ├── calculate_coco_features.py
│   ├── caption.py
│   ├── classification.py
│   ├── dataset_browser.py
│   ├── image_text_match.py
│   ├── main.py
│   ├── multimodal_search.py
│   ├── multipage.py
│   ├── text_localization.py
│   ├── utils.py
│   └── vqa.py
├── dataset_card/
│   ├── avsd_dialogue.md
│   ├── coco_caption.md
│   ├── coco_retrieval.md
│   ├── conceptual_captions.md
│   ├── didemo_retrieval.md
│   ├── flickr_retrieval.md
│   ├── gqa.md
│   ├── msrvtt_qa.md
│   ├── msrvtt_retrieval.md
│   ├── msvd_qa.md
│   ├── nlvr2.md
│   ├── nocaps.md
│   ├── sbu_caption.md
│   ├── snli_visual_entailment.md
│   └── vqav2.md
├── docs/
│   ├── Makefile
│   ├── benchmark.rst
│   ├── build_docs.sh
│   ├── conf.py
│   ├── getting_started.rst
│   ├── index.rst
│   ├── intro.rst
│   ├── make.bat
│   ├── requirements.txt
│   ├── tutorial.configs.rst
│   ├── tutorial.datasets.rst
│   ├── tutorial.evaluation.rst
│   ├── tutorial.models.rst
│   ├── tutorial.processors.rst
│   ├── tutorial.rst
│   ├── tutorial.tasks.rst
│   └── tutorial.training-example.rst
├── evaluate.py
├── examples/
│   ├── albef_feature_extraction.ipynb
│   ├── albef_vqa.ipynb
│   ├── albef_zero_shot_classification.ipynb
│   ├── blip2_feature_extraction.ipynb
│   ├── blip2_image_text_matching.ipynb
│   ├── blip2_instructed_generation.ipynb
│   ├── blip_feature_extraction.ipynb
│   ├── blip_image_captioning.ipynb
│   ├── blip_image_text_matching.ipynb
│   ├── blip_text_localization.ipynb
│   ├── blip_vqa.ipynb
│   ├── blip_zero_shot_classification.ipynb
│   ├── clip_feature_extraction.ipynb
│   └── clip_zero_shot_classification.ipynb
├── lavis/
│   ├── __init__.py
│   ├── common/
│   │   ├── annotator/
│   │   │   ├── canny/
│   │   │   │   └── __init__.py
│   │   │   ├── ckpts/
│   │   │   │   └── download.sh
│   │   │   ├── hed/
│   │   │   │   └── __init__.py
│   │   │   ├── midas/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── api.py
│   │   │   │   ├── midas/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── base_model.py
│   │   │   │   │   ├── blocks.py
│   │   │   │   │   ├── dpt_depth.py
│   │   │   │   │   ├── midas_net.py
│   │   │   │   │   ├── midas_net_custom.py
│   │   │   │   │   ├── transforms.py
│   │   │   │   │   └── vit.py
│   │   │   │   └── utils.py
│   │   │   ├── mlsd/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── models/
│   │   │   │   │   ├── mbv2_mlsd_large.py
│   │   │   │   │   └── mbv2_mlsd_tiny.py
│   │   │   │   └── utils.py
│   │   │   ├── openpose/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── body.py
│   │   │   │   ├── hand.py
│   │   │   │   ├── model.py
│   │   │   │   └── util.py
│   │   │   ├── uniformer/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── configs/
│   │   │   │   │   └── _base_/
│   │   │   │   │       ├── datasets/
│   │   │   │   │       │   ├── ade20k.py
│   │   │   │   │       │   ├── chase_db1.py
│   │   │   │   │       │   ├── cityscapes.py
│   │   │   │   │       │   ├── cityscapes_769x769.py
│   │   │   │   │       │   ├── drive.py
│   │   │   │   │       │   ├── hrf.py
│   │   │   │   │       │   ├── pascal_context.py
│   │   │   │   │       │   ├── pascal_context_59.py
│   │   │   │   │       │   ├── pascal_voc12.py
│   │   │   │   │       │   ├── pascal_voc12_aug.py
│   │   │   │   │       │   └── stare.py
│   │   │   │   │       ├── default_runtime.py
│   │   │   │   │       ├── models/
│   │   │   │   │       │   ├── ann_r50-d8.py
│   │   │   │   │       │   ├── apcnet_r50-d8.py
│   │   │   │   │       │   ├── ccnet_r50-d8.py
│   │   │   │   │       │   ├── cgnet.py
│   │   │   │   │       │   ├── danet_r50-d8.py
│   │   │   │   │       │   ├── deeplabv3_r50-d8.py
│   │   │   │   │       │   ├── deeplabv3_unet_s5-d16.py
│   │   │   │   │       │   ├── deeplabv3plus_r50-d8.py
│   │   │   │   │       │   ├── dmnet_r50-d8.py
│   │   │   │   │       │   ├── dnl_r50-d8.py
│   │   │   │   │       │   ├── emanet_r50-d8.py
│   │   │   │   │       │   ├── encnet_r50-d8.py
│   │   │   │   │       │   ├── fast_scnn.py
│   │   │   │   │       │   ├── fcn_hr18.py
│   │   │   │   │       │   ├── fcn_r50-d8.py
│   │   │   │   │       │   ├── fcn_unet_s5-d16.py
│   │   │   │   │       │   ├── fpn_r50.py
│   │   │   │   │       │   ├── fpn_uniformer.py
│   │   │   │   │       │   ├── gcnet_r50-d8.py
│   │   │   │   │       │   ├── lraspp_m-v3-d8.py
│   │   │   │   │       │   ├── nonlocal_r50-d8.py
│   │   │   │   │       │   ├── ocrnet_hr18.py
│   │   │   │   │       │   ├── ocrnet_r50-d8.py
│   │   │   │   │       │   ├── pointrend_r50.py
│   │   │   │   │       │   ├── psanet_r50-d8.py
│   │   │   │   │       │   ├── pspnet_r50-d8.py
│   │   │   │   │       │   ├── pspnet_unet_s5-d16.py
│   │   │   │   │       │   ├── upernet_r50.py
│   │   │   │   │       │   └── upernet_uniformer.py
│   │   │   │   │       └── schedules/
│   │   │   │   │           ├── schedule_160k.py
│   │   │   │   │           ├── schedule_20k.py
│   │   │   │   │           ├── schedule_40k.py
│   │   │   │   │           └── schedule_80k.py
│   │   │   │   ├── exp/
│   │   │   │   │   └── upernet_global_small/
│   │   │   │   │       ├── config.py
│   │   │   │   │       ├── run.sh
│   │   │   │   │       ├── test.sh
│   │   │   │   │       ├── test_config_g.py
│   │   │   │   │       ├── test_config_h32.py
│   │   │   │   │       └── test_config_w32.py
│   │   │   │   ├── mmcv/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── arraymisc/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── quantization.py
│   │   │   │   │   ├── cnn/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── alexnet.py
│   │   │   │   │   │   ├── bricks/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── activation.py
│   │   │   │   │   │   │   ├── context_block.py
│   │   │   │   │   │   │   ├── conv.py
│   │   │   │   │   │   │   ├── conv2d_adaptive_padding.py
│   │   │   │   │   │   │   ├── conv_module.py
│   │   │   │   │   │   │   ├── conv_ws.py
│   │   │   │   │   │   │   ├── depthwise_separable_conv_module.py
│   │   │   │   │   │   │   ├── drop.py
│   │   │   │   │   │   │   ├── generalized_attention.py
│   │   │   │   │   │   │   ├── hsigmoid.py
│   │   │   │   │   │   │   ├── hswish.py
│   │   │   │   │   │   │   ├── non_local.py
│   │   │   │   │   │   │   ├── norm.py
│   │   │   │   │   │   │   ├── padding.py
│   │   │   │   │   │   │   ├── plugin.py
│   │   │   │   │   │   │   ├── registry.py
│   │   │   │   │   │   │   ├── scale.py
│   │   │   │   │   │   │   ├── swish.py
│   │   │   │   │   │   │   ├── transformer.py
│   │   │   │   │   │   │   ├── upsample.py
│   │   │   │   │   │   │   └── wrappers.py
│   │   │   │   │   │   ├── builder.py
│   │   │   │   │   │   ├── resnet.py
│   │   │   │   │   │   ├── utils/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── flops_counter.py
│   │   │   │   │   │   │   ├── fuse_conv_bn.py
│   │   │   │   │   │   │   ├── sync_bn.py
│   │   │   │   │   │   │   └── weight_init.py
│   │   │   │   │   │   └── vgg.py
│   │   │   │   │   ├── engine/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── test.py
│   │   │   │   │   ├── fileio/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── file_client.py
│   │   │   │   │   │   ├── handlers/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── base.py
│   │   │   │   │   │   │   ├── json_handler.py
│   │   │   │   │   │   │   ├── pickle_handler.py
│   │   │   │   │   │   │   └── yaml_handler.py
│   │   │   │   │   │   ├── io.py
│   │   │   │   │   │   └── parse.py
│   │   │   │   │   ├── image/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── colorspace.py
│   │   │   │   │   │   ├── geometric.py
│   │   │   │   │   │   ├── io.py
│   │   │   │   │   │   ├── misc.py
│   │   │   │   │   │   └── photometric.py
│   │   │   │   │   ├── model_zoo/
│   │   │   │   │   │   ├── deprecated.json
│   │   │   │   │   │   ├── mmcls.json
│   │   │   │   │   │   └── open_mmlab.json
│   │   │   │   │   ├── ops/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── assign_score_withk.py
│   │   │   │   │   │   ├── ball_query.py
│   │   │   │   │   │   ├── bbox.py
│   │   │   │   │   │   ├── border_align.py
│   │   │   │   │   │   ├── box_iou_rotated.py
│   │   │   │   │   │   ├── carafe.py
│   │   │   │   │   │   ├── cc_attention.py
│   │   │   │   │   │   ├── contour_expand.py
│   │   │   │   │   │   ├── corner_pool.py
│   │   │   │   │   │   ├── correlation.py
│   │   │   │   │   │   ├── deform_conv.py
│   │   │   │   │   │   ├── deform_roi_pool.py
│   │   │   │   │   │   ├── deprecated_wrappers.py
│   │   │   │   │   │   ├── focal_loss.py
│   │   │   │   │   │   ├── furthest_point_sample.py
│   │   │   │   │   │   ├── fused_bias_leakyrelu.py
│   │   │   │   │   │   ├── gather_points.py
│   │   │   │   │   │   ├── group_points.py
│   │   │   │   │   │   ├── info.py
│   │   │   │   │   │   ├── iou3d.py
│   │   │   │   │   │   ├── knn.py
│   │   │   │   │   │   ├── masked_conv.py
│   │   │   │   │   │   ├── merge_cells.py
│   │   │   │   │   │   ├── modulated_deform_conv.py
│   │   │   │   │   │   ├── multi_scale_deform_attn.py
│   │   │   │   │   │   ├── nms.py
│   │   │   │   │   │   ├── pixel_group.py
│   │   │   │   │   │   ├── point_sample.py
│   │   │   │   │   │   ├── points_in_boxes.py
│   │   │   │   │   │   ├── points_sampler.py
│   │   │   │   │   │   ├── psa_mask.py
│   │   │   │   │   │   ├── roi_align.py
│   │   │   │   │   │   ├── roi_align_rotated.py
│   │   │   │   │   │   ├── roi_pool.py
│   │   │   │   │   │   ├── roiaware_pool3d.py
│   │   │   │   │   │   ├── roipoint_pool3d.py
│   │   │   │   │   │   ├── saconv.py
│   │   │   │   │   │   ├── scatter_points.py
│   │   │   │   │   │   ├── sync_bn.py
│   │   │   │   │   │   ├── three_interpolate.py
│   │   │   │   │   │   ├── three_nn.py
│   │   │   │   │   │   ├── tin_shift.py
│   │   │   │   │   │   ├── upfirdn2d.py
│   │   │   │   │   │   └── voxelize.py
│   │   │   │   │   ├── parallel/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── _functions.py
│   │   │   │   │   │   ├── collate.py
│   │   │   │   │   │   ├── data_container.py
│   │   │   │   │   │   ├── data_parallel.py
│   │   │   │   │   │   ├── distributed.py
│   │   │   │   │   │   ├── distributed_deprecated.py
│   │   │   │   │   │   ├── registry.py
│   │   │   │   │   │   ├── scatter_gather.py
│   │   │   │   │   │   └── utils.py
│   │   │   │   │   ├── runner/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── base_module.py
│   │   │   │   │   │   ├── base_runner.py
│   │   │   │   │   │   ├── builder.py
│   │   │   │   │   │   ├── checkpoint.py
│   │   │   │   │   │   ├── default_constructor.py
│   │   │   │   │   │   ├── dist_utils.py
│   │   │   │   │   │   ├── epoch_based_runner.py
│   │   │   │   │   │   ├── fp16_utils.py
│   │   │   │   │   │   ├── hooks/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── checkpoint.py
│   │   │   │   │   │   │   ├── closure.py
│   │   │   │   │   │   │   ├── ema.py
│   │   │   │   │   │   │   ├── evaluation.py
│   │   │   │   │   │   │   ├── hook.py
│   │   │   │   │   │   │   ├── iter_timer.py
│   │   │   │   │   │   │   ├── logger/
│   │   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   │   ├── base.py
│   │   │   │   │   │   │   │   ├── dvclive.py
│   │   │   │   │   │   │   │   ├── mlflow.py
│   │   │   │   │   │   │   │   ├── neptune.py
│   │   │   │   │   │   │   │   ├── pavi.py
│   │   │   │   │   │   │   │   ├── tensorboard.py
│   │   │   │   │   │   │   │   ├── text.py
│   │   │   │   │   │   │   │   └── wandb.py
│   │   │   │   │   │   │   ├── lr_updater.py
│   │   │   │   │   │   │   ├── memory.py
│   │   │   │   │   │   │   ├── momentum_updater.py
│   │   │   │   │   │   │   ├── optimizer.py
│   │   │   │   │   │   │   ├── profiler.py
│   │   │   │   │   │   │   ├── sampler_seed.py
│   │   │   │   │   │   │   └── sync_buffer.py
│   │   │   │   │   │   ├── iter_based_runner.py
│   │   │   │   │   │   ├── log_buffer.py
│   │   │   │   │   │   ├── optimizer/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── builder.py
│   │   │   │   │   │   │   └── default_constructor.py
│   │   │   │   │   │   ├── priority.py
│   │   │   │   │   │   └── utils.py
│   │   │   │   │   ├── utils/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── config.py
│   │   │   │   │   │   ├── env.py
│   │   │   │   │   │   ├── ext_loader.py
│   │   │   │   │   │   ├── logging.py
│   │   │   │   │   │   ├── misc.py
│   │   │   │   │   │   ├── parrots_jit.py
│   │   │   │   │   │   ├── parrots_wrapper.py
│   │   │   │   │   │   ├── path.py
│   │   │   │   │   │   ├── progressbar.py
│   │   │   │   │   │   ├── registry.py
│   │   │   │   │   │   ├── testing.py
│   │   │   │   │   │   ├── timer.py
│   │   │   │   │   │   ├── trace.py
│   │   │   │   │   │   └── version_utils.py
│   │   │   │   │   ├── version.py
│   │   │   │   │   ├── video/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── io.py
│   │   │   │   │   │   ├── optflow.py
│   │   │   │   │   │   └── processing.py
│   │   │   │   │   └── visualization/
│   │   │   │   │       ├── __init__.py
│   │   │   │   │       ├── color.py
│   │   │   │   │       ├── image.py
│   │   │   │   │       └── optflow.py
│   │   │   │   ├── mmcv_custom/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   └── checkpoint.py
│   │   │   │   └── mmseg/
│   │   │   │       ├── apis/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── inference.py
│   │   │   │       │   ├── test.py
│   │   │   │       │   └── train.py
│   │   │   │       ├── core/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── evaluation/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── class_names.py
│   │   │   │       │   │   ├── eval_hooks.py
│   │   │   │       │   │   └── metrics.py
│   │   │   │       │   ├── seg/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── builder.py
│   │   │   │       │   │   └── sampler/
│   │   │   │       │   │       ├── __init__.py
│   │   │   │       │   │       ├── base_pixel_sampler.py
│   │   │   │       │   │       └── ohem_pixel_sampler.py
│   │   │   │       │   └── utils/
│   │   │   │       │       ├── __init__.py
│   │   │   │       │       └── misc.py
│   │   │   │       ├── datasets/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── ade.py
│   │   │   │       │   ├── builder.py
│   │   │   │       │   ├── chase_db1.py
│   │   │   │       │   ├── cityscapes.py
│   │   │   │       │   ├── custom.py
│   │   │   │       │   ├── dataset_wrappers.py
│   │   │   │       │   ├── drive.py
│   │   │   │       │   ├── hrf.py
│   │   │   │       │   ├── pascal_context.py
│   │   │   │       │   ├── pipelines/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── compose.py
│   │   │   │       │   │   ├── formating.py
│   │   │   │       │   │   ├── loading.py
│   │   │   │       │   │   ├── test_time_aug.py
│   │   │   │       │   │   └── transforms.py
│   │   │   │       │   ├── stare.py
│   │   │   │       │   └── voc.py
│   │   │   │       ├── models/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── backbones/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── cgnet.py
│   │   │   │       │   │   ├── fast_scnn.py
│   │   │   │       │   │   ├── hrnet.py
│   │   │   │       │   │   ├── mobilenet_v2.py
│   │   │   │       │   │   ├── mobilenet_v3.py
│   │   │   │       │   │   ├── resnest.py
│   │   │   │       │   │   ├── resnet.py
│   │   │   │       │   │   ├── resnext.py
│   │   │   │       │   │   ├── unet.py
│   │   │   │       │   │   ├── uniformer.py
│   │   │   │       │   │   └── vit.py
│   │   │   │       │   ├── builder.py
│   │   │   │       │   ├── decode_heads/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── ann_head.py
│   │   │   │       │   │   ├── apc_head.py
│   │   │   │       │   │   ├── aspp_head.py
│   │   │   │       │   │   ├── cascade_decode_head.py
│   │   │   │       │   │   ├── cc_head.py
│   │   │   │       │   │   ├── da_head.py
│   │   │   │       │   │   ├── decode_head.py
│   │   │   │       │   │   ├── dm_head.py
│   │   │   │       │   │   ├── dnl_head.py
│   │   │   │       │   │   ├── ema_head.py
│   │   │   │       │   │   ├── enc_head.py
│   │   │   │       │   │   ├── fcn_head.py
│   │   │   │       │   │   ├── fpn_head.py
│   │   │   │       │   │   ├── gc_head.py
│   │   │   │       │   │   ├── lraspp_head.py
│   │   │   │       │   │   ├── nl_head.py
│   │   │   │       │   │   ├── ocr_head.py
│   │   │   │       │   │   ├── point_head.py
│   │   │   │       │   │   ├── psa_head.py
│   │   │   │       │   │   ├── psp_head.py
│   │   │   │       │   │   ├── sep_aspp_head.py
│   │   │   │       │   │   ├── sep_fcn_head.py
│   │   │   │       │   │   └── uper_head.py
│   │   │   │       │   ├── losses/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── accuracy.py
│   │   │   │       │   │   ├── cross_entropy_loss.py
│   │   │   │       │   │   ├── dice_loss.py
│   │   │   │       │   │   ├── lovasz_loss.py
│   │   │   │       │   │   └── utils.py
│   │   │   │       │   ├── necks/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── fpn.py
│   │   │   │       │   │   └── multilevel_neck.py
│   │   │   │       │   ├── segmentors/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── base.py
│   │   │   │       │   │   ├── cascade_encoder_decoder.py
│   │   │   │       │   │   └── encoder_decoder.py
│   │   │   │       │   └── utils/
│   │   │   │       │       ├── __init__.py
│   │   │   │       │       ├── drop.py
│   │   │   │       │       ├── inverted_residual.py
│   │   │   │       │       ├── make_divisible.py
│   │   │   │       │       ├── res_layer.py
│   │   │   │       │       ├── se_layer.py
│   │   │   │       │       ├── self_attention_block.py
│   │   │   │       │       ├── up_conv_block.py
│   │   │   │       │       └── weight_init.py
│   │   │   │       ├── ops/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── encoding.py
│   │   │   │       │   └── wrappers.py
│   │   │   │       └── utils/
│   │   │   │           ├── __init__.py
│   │   │   │           ├── collect_env.py
│   │   │   │           └── logger.py
│   │   │   └── util.py
│   │   ├── config.py
│   │   ├── dist_utils.py
│   │   ├── gradcam.py
│   │   ├── logger.py
│   │   ├── optims.py
│   │   ├── registry.py
│   │   ├── utils.py
│   │   └── vqa_tools/
│   │       ├── __init__.py
│   │       ├── vqa.py
│   │       └── vqa_eval.py
│   ├── configs/
│   │   ├── datasets/
│   │   │   ├── aokvqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── audiocaps/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   ├── defaults_mm_cap_instruct.yaml
│   │   │   │   └── defaults_mm_qa.yaml
│   │   │   ├── audioset/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── avsd/
│   │   │   │   ├── defaults_dial.yaml
│   │   │   │   └── defaults_mm_dial_instruct.yaml
│   │   │   ├── blip_diffusion_datasets/
│   │   │   │   └── defaults.yaml
│   │   │   ├── capfilt14m/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── charade/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── clotho/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   ├── defaults_mm_cap_instruct.yaml
│   │   │   │   └── defaults_mm_qa.yaml
│   │   │   ├── coco/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_ret.yaml
│   │   │   │   ├── defaults_vqa.yaml
│   │   │   │   ├── defaults_vqa_instruct.yaml
│   │   │   │   └── eval_vqa.yaml
│   │   │   ├── coin/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── conceptual_caption/
│   │   │   │   ├── defaults_12m.yaml
│   │   │   │   ├── defaults_12m_instruct.yaml
│   │   │   │   ├── defaults_3m.yaml
│   │   │   │   └── defaults_3m_instruct.yaml
│   │   │   ├── didemo/
│   │   │   │   └── defaults_ret.yaml
│   │   │   ├── discriminatory_reasoning/
│   │   │   │   ├── defaults_mm_audio_video.yaml
│   │   │   │   ├── defaults_mm_image_pc.yaml
│   │   │   │   └── discriminatory_dataset/
│   │   │   │       ├── audiocaps_discrn.json
│   │   │   │       └── objaverse_discrn.json
│   │   │   ├── esc50/
│   │   │   │   └── defaults_mm_cls.yaml
│   │   │   ├── flickr30k/
│   │   │   │   ├── defaults.yaml
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── gqa/
│   │   │   │   ├── balanced_testdev.yaml
│   │   │   │   ├── balanced_testdev_instruct.yaml
│   │   │   │   ├── balanced_val.yaml
│   │   │   │   ├── balanced_val_instruct.yaml
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── iconqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── imagenet/
│   │   │   │   └── defaults.yaml
│   │   │   ├── laion/
│   │   │   │   ├── defaults_2B_multi.yaml
│   │   │   │   ├── defaults_400M.yaml
│   │   │   │   └── defaults_400M_instruct.yaml
│   │   │   ├── llava150k/
│   │   │   │   └── defaults_dial.yaml
│   │   │   ├── modelnet40/
│   │   │   │   └── defaults_cls.yaml
│   │   │   ├── msrvtt/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_qa.yaml
│   │   │   │   ├── defaults_qa_instruct.yaml
│   │   │   │   └── defaults_ret.yaml
│   │   │   ├── msvd/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_qa.yaml
│   │   │   │   └── defaults_qa_instruct.yaml
│   │   │   ├── music_avqa/
│   │   │   │   ├── defaults_mm_qa.yaml
│   │   │   │   └── defaults_mm_qa_instruct.yaml
│   │   │   ├── nlvr/
│   │   │   │   └── defaults.yaml
│   │   │   ├── nocaps/
│   │   │   │   └── defaults.yaml
│   │   │   ├── objaverse/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   ├── defaults_mm_cap_instruct.yaml
│   │   │   │   └── defaults_mm_qa.yaml
│   │   │   ├── ocrvqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── okvqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── sbu_caption/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── scienceqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── shapenet/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── snli_ve/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── textcaps/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── valor/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── vatex/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── vg/
│   │   │   │   ├── defaults_caption.yaml
│   │   │   │   ├── defaults_caption_instruct.yaml
│   │   │   │   ├── defaults_vqa.yaml
│   │   │   │   └── defaults_vqa_instruct.yaml
│   │   │   ├── violin/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_entail.yaml
│   │   │   │   └── defaults_entail_instruct.yaml
│   │   │   ├── visdial/
│   │   │   │   ├── defaults_dial.yaml
│   │   │   │   └── defaults_dial_instruct.yaml
│   │   │   ├── vizwiz/
│   │   │   │   └── defaults.yaml
│   │   │   ├── vlep/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── vsr/
│   │   │   │   ├── defaults.yaml
│   │   │   │   ├── defaults_classification.yaml
│   │   │   │   ├── defaults_classification_instruct.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── wavcaps/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── webvid/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── youcook/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   └── yt8m/
│   │   │       └── defaults_mm_dial.yaml
│   │   ├── default.yaml
│   │   └── models/
│   │       ├── albef_classification_ve.yaml
│   │       ├── albef_feature_extractor.yaml
│   │       ├── albef_nlvr.yaml
│   │       ├── albef_pretrain_base.yaml
│   │       ├── albef_retrieval_coco.yaml
│   │       ├── albef_retrieval_flickr.yaml
│   │       ├── albef_vqav2.yaml
│   │       ├── alpro_qa_msrvtt.yaml
│   │       ├── alpro_qa_msvd.yaml
│   │       ├── alpro_retrieval_didemo.yaml
│   │       ├── alpro_retrieval_msrvtt.yaml
│   │       ├── bert_config.json
│   │       ├── bert_config_alpro.json
│   │       ├── blip-diffusion/
│   │       │   ├── blip_diffusion_base.yaml
│   │       │   ├── blip_diffusion_controlnet_canny.yaml
│   │       │   ├── blip_diffusion_controlnet_depth.yaml
│   │       │   └── blip_diffusion_controlnet_hed.yaml
│   │       ├── blip2/
│   │       │   ├── blip2_caption_flant5xl.yaml
│   │       │   ├── blip2_caption_opt2.7b.yaml
│   │       │   ├── blip2_caption_opt6.7b.yaml
│   │       │   ├── blip2_coco.yaml
│   │       │   ├── blip2_instruct_flant5xl.yaml
│   │       │   ├── blip2_instruct_flant5xxl.yaml
│   │       │   ├── blip2_instruct_vicuna13b.yaml
│   │       │   ├── blip2_instruct_vicuna7b.yaml
│   │       │   ├── blip2_pretrain.yaml
│   │       │   ├── blip2_pretrain_flant5xl.yaml
│   │       │   ├── blip2_pretrain_flant5xl_vitL.yaml
│   │       │   ├── blip2_pretrain_flant5xxl.yaml
│   │       │   ├── blip2_pretrain_llama7b.yaml
│   │       │   ├── blip2_pretrain_opt2.7b.yaml
│   │       │   ├── blip2_pretrain_opt6.7b.yaml
│   │       │   ├── blip2_pretrain_vitL.yaml
│   │       │   ├── blip2_xinstruct_vicuna13b.yaml
│   │       │   └── blip2_xinstruct_vicuna7b.yaml
│   │       ├── blip_caption_base_coco.yaml
│   │       ├── blip_caption_large_coco.yaml
│   │       ├── blip_classification_base.yaml
│   │       ├── blip_feature_extractor_base.yaml
│   │       ├── blip_itm_base.yaml
│   │       ├── blip_itm_large.yaml
│   │       ├── blip_nlvr.yaml
│   │       ├── blip_pretrain_base.yaml
│   │       ├── blip_pretrain_large.yaml
│   │       ├── blip_retrieval_coco.yaml
│   │       ├── blip_retrieval_flickr.yaml
│   │       ├── blip_vqa_aokvqa.yaml
│   │       ├── blip_vqa_okvqa.yaml
│   │       ├── blip_vqav2.yaml
│   │       ├── clip/
│   │       │   ├── RN101-quickgelu.json
│   │       │   ├── RN101.json
│   │       │   ├── RN50-quickgelu.json
│   │       │   ├── RN50.json
│   │       │   ├── RN50x16.json
│   │       │   ├── RN50x4.json
│   │       │   ├── ViT-B-16-plus-240.json
│   │       │   ├── ViT-B-16-plus.json
│   │       │   ├── ViT-B-16.json
│   │       │   ├── ViT-B-32-plus-256.json
│   │       │   ├── ViT-B-32-quickgelu.json
│   │       │   ├── ViT-B-32.json
│   │       │   ├── ViT-H-14.json
│   │       │   ├── ViT-H-16.json
│   │       │   ├── ViT-L-14-280.json
│   │       │   ├── ViT-L-14-336.json
│   │       │   ├── ViT-L-14.json
│   │       │   ├── ViT-L-16-320.json
│   │       │   ├── ViT-L-16.json
│   │       │   ├── ViT-g-14.json
│   │       │   ├── timm-efficientnetv2_rw_s.json
│   │       │   ├── timm-resnet50d.json
│   │       │   ├── timm-resnetaa50d.json
│   │       │   ├── timm-resnetblur50.json
│   │       │   ├── timm-swin_base_patch4_window7_224.json
│   │       │   ├── timm-vit_base_patch16_224.json
│   │       │   ├── timm-vit_base_patch32_224.json
│   │       │   └── timm-vit_small_patch16_224.json
│   │       ├── clip_resnet50.yaml
│   │       ├── clip_vit_base16.yaml
│   │       ├── clip_vit_base32.yaml
│   │       ├── clip_vit_large14.yaml
│   │       ├── clip_vit_large14_336.yaml
│   │       ├── gpt_dialogue_base.yaml
│   │       ├── img2prompt-vqa/
│   │       │   └── img2prompt_vqa_base.yaml
│   │       ├── med_config.json
│   │       ├── med_config_albef.json
│   │       ├── med_large_config.json
│   │       └── pnp-vqa/
│   │           ├── pnp_vqa_3b.yaml
│   │           ├── pnp_vqa_base.yaml
│   │           ├── pnp_vqa_large.yaml
│   │           ├── unifiedqav2_3b_config.json
│   │           ├── unifiedqav2_base_config.json
│   │           └── unifiedqav2_large_config.json
│   ├── datasets/
│   │   ├── builders/
│   │   │   ├── __init__.py
│   │   │   ├── audio_caption_builder.py
│   │   │   ├── audio_qa_builder.py
│   │   │   ├── base_dataset_builder.py
│   │   │   ├── caption_builder.py
│   │   │   ├── classification_builder.py
│   │   │   ├── dialogue_builder.py
│   │   │   ├── discrn_builders.py
│   │   │   ├── image_text_pair_builder.py
│   │   │   ├── imagefolder_builder.py
│   │   │   ├── object3d_caption_builder.py
│   │   │   ├── object3d_classification_builder.py
│   │   │   ├── object3d_qa_builder.py
│   │   │   ├── retrieval_builder.py
│   │   │   ├── text_to_image_generation_builder.py
│   │   │   ├── video_qa_builder.py
│   │   │   └── vqa_builder.py
│   │   ├── data_utils.py
│   │   ├── datasets/
│   │   │   ├── aok_vqa_datasets.py
│   │   │   ├── audio_captioning_datasets.py
│   │   │   ├── audio_classification_datasets.py
│   │   │   ├── audio_qa_datasets.py
│   │   │   ├── avsd_dialogue_datasets.py
│   │   │   ├── base_dataset.py
│   │   │   ├── capfilt_dataset.py
│   │   │   ├── caption_datasets.py
│   │   │   ├── coco_caption_datasets.py
│   │   │   ├── coco_vqa_datasets.py
│   │   │   ├── dataloader_utils.py
│   │   │   ├── dialogue_datasets.py
│   │   │   ├── discriminatory_reasoning_datasets.py
│   │   │   ├── gqa_datasets.py
│   │   │   ├── iconqa_datasets.py
│   │   │   ├── image_text_pair_datasets.py
│   │   │   ├── imagefolder_dataset.py
│   │   │   ├── laion_dataset.py
│   │   │   ├── llava150k_dataset.py
│   │   │   ├── multimodal_classification_datasets.py
│   │   │   ├── music_avqa.py
│   │   │   ├── nlvr_datasets.py
│   │   │   ├── object3d_captioning_datasets.py
│   │   │   ├── object3d_classification_datasets.py
│   │   │   ├── object3d_qa_datasets.py
│   │   │   ├── ocr_datasets.py
│   │   │   ├── retrieval_datasets.py
│   │   │   ├── snli_ve_datasets.py
│   │   │   ├── subject_driven_t2i_dataset.py
│   │   │   ├── textcaps_datasets.py
│   │   │   ├── valor_caption.py
│   │   │   ├── vatex_captioning_datasets.py
│   │   │   ├── vg_vqa_datasets.py
│   │   │   ├── video_caption_datasets.py
│   │   │   ├── video_vqa_datasets.py
│   │   │   ├── violin_dataset.py
│   │   │   ├── visdial_dialogue_datasets.py
│   │   │   ├── vizwiz_vqa_datasets.py
│   │   │   ├── vlep_dataset.py
│   │   │   ├── vqa_datasets.py
│   │   │   ├── vsr_datasets.py
│   │   │   └── yt8m_video_dialogue_datasets.py
│   │   └── download_scripts/
│   │       ├── DownloadConceptualCaptions/
│   │       │   ├── LICENSE
│   │       │   ├── README.md
│   │       │   ├── create_annotation_12m.ipynb
│   │       │   ├── create_annotation_3m.ipynb
│   │       │   ├── download_data_cc12m.py
│   │       │   └── download_data_cc3m.py
│   │       ├── download_charade.py
│   │       ├── download_coco.py
│   │       ├── download_coin.py
│   │       ├── download_didemo.py
│   │       ├── download_flickr.py
│   │       ├── download_gqa.py
│   │       ├── download_iconqa.py
│   │       ├── download_msrvtt.py
│   │       ├── download_msvd.py
│   │       ├── download_nocaps.py
│   │       ├── download_sbu.py
│   │       ├── download_vg.py
│   │       └── download_violin.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── albef_models/
│   │   │   ├── __init__.py
│   │   │   ├── albef_classification.py
│   │   │   ├── albef_feature_extractor.py
│   │   │   ├── albef_nlvr.py
│   │   │   ├── albef_outputs.py
│   │   │   ├── albef_pretrain.py
│   │   │   ├── albef_retrieval.py
│   │   │   └── albef_vqa.py
│   │   ├── alpro_models/
│   │   │   ├── __init__.py
│   │   │   ├── alpro_outputs.py
│   │   │   ├── alpro_qa.py
│   │   │   └── alpro_retrieval.py
│   │   ├── base_model.py
│   │   ├── beats/
│   │   │   ├── BEATs.py
│   │   │   ├── LICENSE_BEATs.txt
│   │   │   ├── README.md
│   │   │   ├── Tokenizers.py
│   │   │   ├── backbone.py
│   │   │   ├── modules.py
│   │   │   └── quantizer.py
│   │   ├── beats_encoder.py
│   │   ├── blip2_models/
│   │   │   ├── Qformer.py
│   │   │   ├── __init__.py
│   │   │   ├── blip2.py
│   │   │   ├── blip2_image_text_matching.py
│   │   │   ├── blip2_opt.py
│   │   │   ├── blip2_qformer.py
│   │   │   ├── blip2_t5.py
│   │   │   ├── blip2_t5_instruct.py
│   │   │   ├── blip2_vicuna_instruct.py
│   │   │   ├── blip2_vicuna_xinstruct.py
│   │   │   ├── modeling_llama.py
│   │   │   ├── modeling_opt.py
│   │   │   └── modeling_t5.py
│   │   ├── blip_diffusion_models/
│   │   │   ├── __init__.py
│   │   │   ├── blip_diffusion.py
│   │   │   ├── modeling_ctx_clip.py
│   │   │   ├── ptp_utils.py
│   │   │   └── utils.py
│   │   ├── blip_models/
│   │   │   ├── __init__.py
│   │   │   ├── blip.py
│   │   │   ├── blip_caption.py
│   │   │   ├── blip_classification.py
│   │   │   ├── blip_feature_extractor.py
│   │   │   ├── blip_image_text_matching.py
│   │   │   ├── blip_nlvr.py
│   │   │   ├── blip_outputs.py
│   │   │   ├── blip_pretrain.py
│   │   │   ├── blip_retrieval.py
│   │   │   ├── blip_vqa.py
│   │   │   └── nlvr_encoder.py
│   │   ├── clip_models/
│   │   │   ├── __init__.py
│   │   │   ├── clip_outputs.py
│   │   │   ├── loss.py
│   │   │   ├── model.py
│   │   │   ├── pretrained.py
│   │   │   ├── timm_model.py
│   │   │   ├── tokenizer.py
│   │   │   ├── transform.py
│   │   │   └── utils.py
│   │   ├── clip_vit.py
│   │   ├── eva_vit.py
│   │   ├── gpt_models/
│   │   │   └── gpt_dialogue.py
│   │   ├── img2prompt_models/
│   │   │   ├── __init__.py
│   │   │   └── img2prompt_vqa.py
│   │   ├── med.py
│   │   ├── pnp_vqa_models/
│   │   │   ├── __init__.py
│   │   │   ├── pnp_unifiedqav2_fid.py
│   │   │   └── pnp_vqa.py
│   │   ├── timesformer/
│   │   │   ├── __init__.py
│   │   │   ├── conv2d_same.py
│   │   │   ├── features.py
│   │   │   ├── helpers.py
│   │   │   ├── linear.py
│   │   │   ├── vit.py
│   │   │   └── vit_utils.py
│   │   ├── ulip_models/
│   │   │   ├── ULIP_models.py
│   │   │   ├── losses.py
│   │   │   ├── pointbert/
│   │   │   │   ├── PointTransformer_8192point.yaml
│   │   │   │   ├── checkpoint.py
│   │   │   │   ├── dvae.py
│   │   │   │   ├── logger.py
│   │   │   │   ├── misc.py
│   │   │   │   └── point_encoder.py
│   │   │   ├── ulip_scaled_up_config.yaml
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── build.py
│   │   │       ├── config.py
│   │   │       ├── io.py
│   │   │       ├── logger.py
│   │   │       ├── registry.py
│   │   │       ├── tokenizer.py
│   │   │       └── utils.py
│   │   └── vit.py
│   ├── processors/
│   │   ├── __init__.py
│   │   ├── alpro_processors.py
│   │   ├── audio_processors.py
│   │   ├── base_processor.py
│   │   ├── blip_diffusion_processors.py
│   │   ├── blip_processors.py
│   │   ├── clip_processors.py
│   │   ├── functional_video.py
│   │   ├── gpt_processors.py
│   │   ├── instruction_text_processors.py
│   │   ├── randaugment.py
│   │   ├── transforms_video.py
│   │   └── ulip_processors.py
│   ├── projects/
│   │   ├── albef/
│   │   │   ├── eval/
│   │   │   │   ├── nlvr_eval.yaml
│   │   │   │   ├── ret_coco_eval.yaml
│   │   │   │   ├── ret_flickr30k_eval.yaml
│   │   │   │   ├── snli_ve_eval.yaml
│   │   │   │   ├── vqa_test.yaml
│   │   │   │   └── vqa_val.yaml
│   │   │   └── train/
│   │   │       ├── aokvqa_ft.yaml
│   │   │       ├── nlvr_ft.yaml
│   │   │       ├── okvqa_ft.yaml
│   │   │       ├── pretrain.yaml
│   │   │       ├── ret_coco_ft.yaml
│   │   │       ├── ret_flickr30k_ft.yaml
│   │   │       ├── snli_ve_ft.yaml
│   │   │       └── vqa_ft.yaml
│   │   ├── alpro/
│   │   │   ├── eval/
│   │   │   │   ├── didemo_ret_eval.yaml
│   │   │   │   ├── msrvtt_qa_eval.yaml
│   │   │   │   ├── msrvtt_ret_eval.yaml
│   │   │   │   └── msvd_qa_eval.yaml
│   │   │   └── train/
│   │   │       ├── didemo_ret_ft.yaml
│   │   │       ├── msrvtt_qa_ft.yaml
│   │   │       ├── msrvtt_retrieval_ft.yaml
│   │   │       └── msvd_qa_ft.yaml
│   │   ├── blip/
│   │   │   ├── coco_cap_ft_iter.yaml
│   │   │   ├── eval/
│   │   │   │   ├── aokvqa_eval.yaml
│   │   │   │   ├── caption_coco_eval.yaml
│   │   │   │   ├── caption_coco_eval_large.yaml
│   │   │   │   ├── nlvr_eval.yaml
│   │   │   │   ├── nocaps_eval.yaml
│   │   │   │   ├── okvqa_eval.yaml
│   │   │   │   ├── ret_coco_eval.yaml
│   │   │   │   ├── ret_flickr_eval.yaml
│   │   │   │   └── vqav2_eval.yaml
│   │   │   └── train/
│   │   │       ├── aokvqa_ft.yaml
│   │   │       ├── caption_coco_ft.yaml
│   │   │       ├── caption_coco_large_ft.yaml
│   │   │       ├── nlvr_ft.yaml
│   │   │       ├── okvqa_ft.yaml
│   │   │       ├── pretrain_14m.yaml
│   │   │       ├── retrieval_coco_ft.yaml
│   │   │       ├── retrieval_flickr_ft.yaml
│   │   │       └── vqav2_ft.yaml
│   │   ├── blip2/
│   │   │   ├── eval/
│   │   │   │   ├── caption_coco_flant5xl_eval.yaml
│   │   │   │   ├── caption_coco_opt2.7b_eval.yaml
│   │   │   │   ├── caption_coco_opt6.7b_eval.yaml
│   │   │   │   ├── caption_nocaps_out_domain_flant5xl_eval.yaml
│   │   │   │   ├── caption_nocaps_out_domain_flant5xxl_eval.yaml
│   │   │   │   ├── gqa_zeroshot_flant5xl_eval.yaml
│   │   │   │   ├── okvqa_zeroshot_flant5xl_eval.yaml
│   │   │   │   ├── ret_coco_eval.yaml
│   │   │   │   ├── ret_flickr_eval.yaml
│   │   │   │   ├── vqav2_zeroshot_flant5xl_eval.yaml
│   │   │   │   └── vqav2_zeroshot_opt_eval.yaml
│   │   │   └── train/
│   │   │       ├── caption_coco_ft.yaml
│   │   │       ├── pretrain_stage1.yaml
│   │   │       ├── pretrain_stage2.yaml
│   │   │       └── retrieval_coco_ft.yaml
│   │   ├── blip_diffusion/
│   │   │   ├── finetune-db-dog.yaml
│   │   │   ├── finetune-db-pink-dress.yaml
│   │   │   ├── finetune-db-shein-jacket.yaml
│   │   │   └── finetune-db-template.yaml
│   │   ├── clip/
│   │   │   ├── exp_coco_ret_eval.yaml
│   │   │   ├── exp_flickr_ret_eval.yaml
│   │   │   └── exp_imnet_zs_eval.yaml
│   │   ├── gpt/
│   │   │   ├── eval/
│   │   │   │   └── dialogue_avsd_eval.yaml
│   │   │   └── train/
│   │   │       └── dialogue_avsd_ft.yaml
│   │   ├── instructblip/
│   │   │   ├── caption_coco_flant5xl_eval_test.yaml
│   │   │   ├── caption_coco_flant5xl_eval_val.yaml
│   │   │   ├── caption_coco_flant5xxl_eval_test.yaml
│   │   │   ├── caption_coco_flant5xxl_eval_val.yaml
│   │   │   ├── caption_coco_vicuna13b_eval_test.yaml
│   │   │   ├── caption_coco_vicuna13b_eval_val.yaml
│   │   │   ├── caption_coco_vicuna7b_eval_test.yaml
│   │   │   ├── caption_coco_vicuna7b_eval_val.yaml
│   │   │   ├── caption_msrvtt_flant5xl_eval_test.yaml
│   │   │   ├── caption_msrvtt_flant5xl_eval_val.yaml
│   │   │   ├── caption_msrvtt_flant5xxl_eval_test.yaml
│   │   │   ├── caption_msrvtt_flant5xxl_eval_val.yaml
│   │   │   ├── caption_msrvtt_vicuna13b_eval_test.yaml
│   │   │   ├── caption_msrvtt_vicuna13b_eval_val.yaml
│   │   │   ├── caption_msrvtt_vicuna7b_eval_test.yaml
│   │   │   ├── caption_msrvtt_vicuna7b_eval_val.yaml
│   │   │   ├── caption_msvd_flant5xl_eval.yaml
│   │   │   ├── caption_msvd_flant5xxl_eval.yaml
│   │   │   ├── caption_msvd_vicuna13b_eval.yaml
│   │   │   ├── caption_msvd_vicuna7b_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_flant5xl_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_flant5xxl_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_vicuna13b_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_vicuna7b_eval.yaml
│   │   │   ├── caption_vatex_flant5xl_eval.yaml
│   │   │   ├── caption_vatex_flant5xxl_eval.yaml
│   │   │   ├── caption_vatex_vicuna13b_eval.yaml
│   │   │   ├── caption_vatex_vicuna7b_eval.yaml
│   │   │   ├── classification_modelnet40_vicuna13b.yaml
│   │   │   ├── classification_modelnet40_vicuna7b.yaml
│   │   │   ├── classification_snlive_flant5xl.yaml
│   │   │   ├── classification_snlive_flant5xxl.yaml
│   │   │   ├── classification_snlive_vicuna13b.yaml
│   │   │   ├── classification_snlive_vicuna13b_test.yaml
│   │   │   ├── classification_snlive_vicuna7b_test.yaml
│   │   │   ├── classification_snlive_vicuna7b_val.yaml
│   │   │   ├── completion_modelnet40_vicuna13b.yaml
│   │   │   ├── completion_modelnet40_vicuna7b.yaml
│   │   │   ├── qa_msrvtt_flant5xl_eval_test.yaml
│   │   │   ├── qa_msrvtt_flant5xxl_eval_test.yaml
│   │   │   ├── qa_msrvtt_vicuna13b_eval_test.yaml
│   │   │   ├── qa_msrvtt_vicuna7b_eval_test.yaml
│   │   │   ├── qa_msvd_flant5xl_eval.yaml
│   │   │   ├── qa_msvd_flant5xxl_eval.yaml
│   │   │   ├── qa_msvd_vicuna13b_eval.yaml
│   │   │   ├── qa_msvd_vicuna7b_eval.yaml
│   │   │   ├── qa_okvqa_flant5xl_eval.yaml
│   │   │   ├── qa_okvqa_flant5xxl_eval.yaml
│   │   │   ├── qa_okvqa_vicuna13b_eval.yaml
│   │   │   └── qa_okvqa_vicuna7b_eval.yaml
│   │   ├── pnp-vqa/
│   │   │   └── eval/
│   │   │       ├── gqa_eval.yaml
│   │   │       ├── gqa_eval_3b.yaml
│   │   │       ├── gqa_eval_large.yaml
│   │   │       ├── okvqa_eval.yaml
│   │   │       ├── okvqa_eval_3b.yaml
│   │   │       ├── okvqa_eval_large.yaml
│   │   │       ├── vqav2_eval.yaml
│   │   │       ├── vqav2_eval_3b.yaml
│   │   │       ├── vqav2_eval_large.yaml
│   │   │       ├── vqav2_test_eval.yaml
│   │   │       ├── vqav2_test_eval_3b.yaml
│   │   │       └── vqav2_test_eval_large.yaml
│   │   └── xinstruct_blip/
│   │       ├── eval/
│   │       │   ├── discrn/
│   │       │   │   ├── audio_video_caption.yaml
│   │       │   │   ├── audio_video_caption_13b.yaml
│   │       │   │   ├── audio_video_describe.yaml
│   │       │   │   ├── audio_video_describe_13b.yaml
│   │       │   │   ├── audio_video_describe_nocue.yaml
│   │       │   │   ├── audio_video_describe_proj copy.yaml
│   │       │   │   ├── audio_video_describe_proj.yaml
│   │       │   │   ├── audio_video_describe_rand_init.yaml
│   │       │   │   ├── image_3d_caption.yaml
│   │       │   │   ├── image_3d_caption_13b.yaml
│   │       │   │   ├── image_3d_describe.yaml
│   │       │   │   ├── image_3d_describe_13b.yaml
│   │       │   │   ├── image_3d_describe_no_init.yaml
│   │       │   │   ├── image_3d_describe_nocue.yaml
│   │       │   │   └── image_3d_describe_proj.yaml
│   │       │   ├── vicuna13b/
│   │       │   │   ├── audio/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── crossmodal/
│   │       │   │   │   ├── musicavqa/
│   │       │   │   │   │   ├── musicavqa_audio_eval.yaml
│   │       │   │   │   │   ├── musicavqa_joint_eval.yaml
│   │       │   │   │   │   └── musicavqa_video_eval.yaml
│   │       │   │   │   └── vatex/
│   │       │   │   │       ├── vatex_audio_captioning.yaml
│   │       │   │   │       ├── vatex_captioning.yaml
│   │       │   │   │       ├── vatex_joint_captioning.yaml
│   │       │   │   │       └── vatex_joint_captioning_interleave.yaml
│   │       │   │   ├── image/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_with_coco/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── pc/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── video/
│   │       │   │   │   ├── msrvtt_captioning.yaml
│   │       │   │   │   ├── msrvtt_captioning_test.yaml
│   │       │   │   │   ├── msrvtt_captioning_val.yaml
│   │       │   │   │   ├── msrvtt_qa_test.yaml
│   │       │   │   │   ├── msrvtt_qa_val.yaml
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   ├── vatex_audio_captioning.yaml
│   │       │   │   │   ├── vatex_captioning.yaml
│   │       │   │   │   ├── vatex_joint_captioning.yaml
│   │       │   │   │   └── vatex_joint_captioning_interleave.yaml
│   │       │   │   └── video_image/
│   │       │   │       ├── msvd_captioning.yaml
│   │       │   │       ├── msvd_qa.yaml
│   │       │   │       └── vatex_captioning.yaml
│   │       │   ├── vicuna7b/
│   │       │   │   ├── audio/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── audio_no_init/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── audio_projection_only/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── audio_projection_only_nocue/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── crossmodal/
│   │       │   │   │   ├── musicavqa/
│   │       │   │   │   │   ├── musicavqa_audio_eval.yaml
│   │       │   │   │   │   ├── musicavqa_joint_eval.yaml
│   │       │   │   │   │   └── musicavqa_video_eval.yaml
│   │       │   │   │   └── vatex/
│   │       │   │   │       ├── vatex_audio_captioning.yaml
│   │       │   │   │       ├── vatex_captioning.yaml
│   │       │   │   │       ├── vatex_joint_captioning.yaml
│   │       │   │   │       └── vatex_joint_captioning_interleave.yaml
│   │       │   │   ├── image/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_full_init/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_no_init/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_pre_coco/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_projection_only/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── pc/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_no_init/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_projection_only/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip1/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip2_scaled_up/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip_objaverse/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip_objaverse_shapenet/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip_shapenet/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── video/
│   │       │   │   │   ├── msrvtt_captioning_test.yaml
│   │       │   │   │   ├── msrvtt_captioning_val.yaml
│   │       │   │   │   ├── msrvtt_qa_test.yaml
│   │       │   │   │   ├── msrvtt_qa_val.yaml
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   └── vatex_captioning.yaml
│   │       │   │   ├── video_image/
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   └── vatex_captioning.yaml
│   │       │   │   ├── video_image_pre_coco/
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   └── vatex_captioning.yaml
│   │       │   │   └── video_no_upsample/
│   │       │   │       ├── msrvtt_captioning_test.yaml
│   │       │   │       ├── msrvtt_captioning_val.yaml
│   │       │   │       ├── msrvtt_qa_test.yaml
│   │       │   │       ├── msrvtt_qa_val.yaml
│   │       │   │       ├── msvd_captioning.yaml
│   │       │   │       ├── msvd_captioning_up.yaml
│   │       │   │       ├── msvd_qa.yaml
│   │       │   │       ├── msvd_qa_up.yaml
│   │       │   │       ├── vatex_captioning.yaml
│   │       │   │       └── vatex_captioning_up.yaml
│   │       │   └── vicuna7b_nocue/
│   │       │       ├── audio/
│   │       │       │   ├── audiocaps_captioning_qa.yaml
│   │       │       │   ├── audiocaps_captioning_test.yaml
│   │       │       │   ├── audiocaps_captioning_val.yaml
│   │       │       │   ├── clothoQA_captioning.yaml
│   │       │       │   ├── clothov1_captioning.yaml
│   │       │       │   ├── clothov2_captioning.yaml
│   │       │       │   ├── esc50_classification.yaml
│   │       │       │   └── esc50_classification_completion.yaml
│   │       │       ├── crossmodal/
│   │       │       │   ├── musicavqa/
│   │       │       │   │   ├── musicavqa_audio_eval.yaml
│   │       │       │   │   ├── musicavqa_joint_eval.yaml
│   │       │       │   │   └── musicavqa_video_eval.yaml
│   │       │       │   └── vatex/
│   │       │       │       ├── vatex_audio_captioning.yaml
│   │       │       │       ├── vatex_captioning.yaml
│   │       │       │       └── vatex_joint_captioning.yaml
│   │       │       ├── image/
│   │       │       │   ├── coco_captioning_test.yaml
│   │       │       │   ├── coco_captioning_val.yaml
│   │       │       │   ├── flickr30k_captioning.yaml
│   │       │       │   ├── gqa_qa.yaml
│   │       │       │   ├── nocaps_captioning.yaml
│   │       │       │   ├── nocaps_out_domain_captioning.yaml
│   │       │       │   ├── okvqa_qa.yaml
│   │       │       │   ├── snlive_classification_test.yaml
│   │       │       │   ├── snlive_classification_val.yaml
│   │       │       │   └── vizwiz_qa.yaml
│   │       │       ├── pc/
│   │       │       │   ├── modelnet40_classification.yaml
│   │       │       │   ├── modelnet40_completion.yaml
│   │       │       │   ├── objaverse_captioning.yaml
│   │       │       │   └── objaverse_qa.yaml
│   │       │       ├── video/
│   │       │       │   ├── msrvtt_captioning_test.yaml
│   │       │       │   ├── msrvtt_captioning_val.yaml
│   │       │       │   ├── msrvtt_qa_test.yaml
│   │       │       │   ├── msrvtt_qa_val.yaml
│   │       │       │   ├── msvd_captioning.yaml
│   │       │       │   ├── msvd_qa.yaml
│   │       │       │   └── vatex_captioning.yaml
│   │       │       └── video_image/
│   │       │           ├── msvd_captioning.yaml
│   │       │           ├── msvd_qa.yaml
│   │       │           └── vatex_captioning.yaml
│   │       ├── prompt_variation/
│   │       │   └── nocaps/
│   │       │       ├── instructblip/
│   │       │       │   ├── original.yaml
│   │       │       │   ├── template_1.yaml
│   │       │       │   ├── template_2.yaml
│   │       │       │   ├── template_3.yaml
│   │       │       │   ├── template_4.yaml
│   │       │       │   └── template_5.yaml
│   │       │       └── xinstructblip/
│   │       │           ├── template_1.yaml
│   │       │           ├── template_2.yaml
│   │       │           ├── template_3.yaml
│   │       │           ├── template_4.yaml
│   │       │           └── template_5.yaml
│   │       └── train/
│   │           ├── vicuna13b/
│   │           │   ├── audio_training.yaml
│   │           │   ├── audio_training_continue.yaml
│   │           │   ├── image_train.yaml
│   │           │   ├── image_train_continue.yaml
│   │           │   ├── pc_training.yaml
│   │           │   └── video_training.yaml
│   │           ├── vicuna7b/
│   │           │   ├── audio_training.yaml
│   │           │   ├── audio_training_improved.yaml
│   │           │   ├── audio_training_no_init.yaml
│   │           │   ├── audio_training_projection_only.yaml
│   │           │   ├── audio_training_projection_only_nocue.yaml
│   │           │   ├── image_train.yaml
│   │           │   ├── image_train_improved.yaml
│   │           │   ├── image_train_no_init.yaml
│   │           │   ├── image_train_projection_only.yaml
│   │           │   ├── lora_training.yaml
│   │           │   ├── pc_training.yaml
│   │           │   ├── pc_training_improved.yaml
│   │           │   ├── pc_training_no_init.yaml
│   │           │   ├── pc_training_projection_only.yaml
│   │           │   ├── pc_training_projection_only_nocue.yaml
│   │           │   ├── pc_training_scaled_up.yaml
│   │           │   ├── pc_training_ulip1.yaml
│   │           │   ├── pc_training_ulip2_objaverse_shapenet_k_1.yaml
│   │           │   ├── pc_training_ulip_objaverse.yaml
│   │           │   ├── pc_training_ulip_shapenet.yaml
│   │           │   ├── video_training.yaml
│   │           │   └── video_training_no_msrvtt_upsample.yaml
│   │           └── vicuna7b_nocue/
│   │               ├── audio_training.yaml
│   │               ├── image_train.yaml
│   │               ├── pc_training.yaml
│   │               └── video_training.yaml
│   ├── runners/
│   │   ├── __init__.py
│   │   ├── runner_base.py
│   │   └── runner_iter.py
│   └── tasks/
│       ├── __init__.py
│       ├── base_task.py
│       ├── captioning.py
│       ├── dialogue.py
│       ├── image_text_pretrain.py
│       ├── multimodal_classification.py
│       ├── retrieval.py
│       ├── text_to_image_generation.py
│       ├── vqa.py
│       └── vqa_reading_comprehension.py
├── projects/
│   ├── blip-diffusion/
│   │   ├── README.md
│   │   └── notebooks/
│   │       ├── editing_real_finetuned.ipynb
│   │       ├── editing_real_zeroshot.ipynb
│   │       ├── editing_synthetic_zeroshot.ipynb
│   │       ├── editing_tryon_zeroshot.ipynb
│   │       ├── generation_finetuned_dog.ipynb
│   │       ├── generation_zeroshot.ipynb
│   │       └── stylization.ipynb
│   ├── blip2/
│   │   └── README.md
│   ├── img2llm-vqa/
│   │   ├── README.md
│   │   ├── img2llm_vqa.ipynb
│   │   └── img2llm_vqa.py
│   ├── img2prompt-vqa/
│   │   └── README.md
│   ├── instructblip/
│   │   ├── README.md
│   │   └── run_demo.py
│   ├── pnp-vqa/
│   │   ├── README.md
│   │   └── pnp_vqa.ipynb
│   └── xinstructblip/
│       ├── README.md
│       ├── data_aug/
│       │   ├── 3d_qa_data_generation.py
│       │   └── audio_qa_data_generation.py
│       ├── demo/
│       │   ├── configs/
│       │   │   ├── vicuna13b.yaml
│       │   │   ├── vicuna7b.yaml
│       │   │   ├── vicuna7b_blip_init.yaml
│       │   │   ├── vicuna7b_no_init.yaml
│       │   │   ├── vicuna7b_nocue.yaml
│       │   │   ├── vicuna7b_projection.yaml
│       │   │   ├── vicuna7b_rand.yaml
│       │   │   └── vicuna7b_v2.yaml
│       │   ├── demo.ipynb
│       │   ├── examples/
│       │   │   └── point_cloud/
│       │   │       └── banana.glb
│       │   └── run_demo.py
│       ├── discrn/
│       │   ├── caption_baseline/
│       │   │   ├── predict_audio.py
│       │   │   ├── predict_image.py
│       │   │   ├── predict_pc.py
│       │   │   ├── predict_video.py
│       │   │   └── render_images.py
│       │   └── data_generation/
│       │       ├── audiocaps_video_audio.py
│       │       └── objaverse_img_3d.py
│       └── modelnet_baseline/
│           └── render_images.py
├── pyproject.toml
├── requirements.txt
├── run_scripts/
│   ├── albef/
│   │   ├── eval/
│   │   │   ├── eval_albef_nlvr.sh
│   │   │   ├── eval_albef_ve.sh
│   │   │   ├── eval_coco_retrieval.sh
│   │   │   ├── eval_flickr30k_retrieval.sh
│   │   │   ├── test_albef_vqa.sh
│   │   │   └── val_albef_vqa.sh
│   │   └── train/
│   │       ├── pretrain.sh
│   │       ├── train_aokvqa_albef.sh
│   │       ├── train_coco_retrieval_albef.sh
│   │       ├── train_flickr30k_retrieval_albef.sh
│   │       ├── train_nlvr_albef.sh
│   │       ├── train_okvqa_albef.sh
│   │       ├── train_ve_albef.sh
│   │       └── train_vqa_albef.sh
│   ├── alpro/
│   │   ├── eval/
│   │   │   ├── eval_didemo_ret.sh
│   │   │   ├── eval_msrvtt_qa.sh
│   │   │   ├── eval_msrvtt_ret.sh
│   │   │   └── eval_msvd_qa.sh
│   │   └── train/
│   │       ├── train_didemo_ret.sh
│   │       ├── train_msrvtt_qa.sh
│   │       ├── train_msrvtt_ret.sh
│   │       └── train_msvd_qa.sh
│   ├── blip/
│   │   ├── eval/
│   │   │   ├── eval_aokvqa.sh
│   │   │   ├── eval_coco_cap.sh
│   │   │   ├── eval_coco_cap_large.sh
│   │   │   ├── eval_nlvr.sh
│   │   │   ├── eval_nocaps.sh
│   │   │   ├── eval_okvqa.sh
│   │   │   ├── eval_ret_coco.sh
│   │   │   ├── eval_ret_flickr.sh
│   │   │   └── validate_vqa.sh
│   │   └── train/
│   │       ├── pretrain.sh
│   │       ├── train_aokvqa.sh
│   │       ├── train_caption_coco.sh
│   │       ├── train_caption_coco_large.sh
│   │       ├── train_caption_coco_large_iters.sh
│   │       ├── train_nlvr.sh
│   │       ├── train_okvqa.sh
│   │       ├── train_retrieval_coco.sh
│   │       ├── train_retrieval_flickr.sh
│   │       └── train_vqa.sh
│   ├── blip-diffusion/
│   │   ├── train_db.sh
│   │   ├── train_db_dog.sh
│   │   ├── train_db_jacket_s.sh
│   │   ├── train_db_pink_dress.sh
│   │   └── train_db_shein_jacket.sh
│   ├── blip2/
│   │   ├── eval/
│   │   │   ├── eval_cap_coco_flant5xl.sh
│   │   │   ├── eval_cap_coco_opt2.7b.sh
│   │   │   ├── eval_cap_coco_opt6.7b.sh
│   │   │   ├── eval_gqa_zeroshot_flant5xl.sh
│   │   │   ├── eval_okvqa_zeroshot_flant5xl.sh
│   │   │   ├── eval_ret_coco.sh
│   │   │   ├── eval_ret_flickr.sh
│   │   │   ├── validate_vqa_zeroshot_flant5xl.sh
│   │   │   └── validate_vqa_zeroshot_opt.sh
│   │   └── train/
│   │       ├── pretrain_stage1.sh
│   │       ├── pretrain_stage2.sh
│   │       ├── train_caption_coco.sh
│   │       └── train_retrieval_coco.sh
│   ├── clip/
│   │   └── eval/
│   │       ├── eval_clip_ret_coco.sh
│   │       ├── eval_clip_ret_flickr.sh
│   │       └── eval_clip_zs_imnet.sh
│   ├── gpt/
│   │   ├── eval/
│   │   │   └── eval_video_dialogue_avsd.sh
│   │   └── train/
│   │       └── train_video_dialogue_avsd.sh
│   ├── pnp-vqa/
│   │   └── eval/
│   │       ├── eval_gqa.sh
│   │       ├── eval_gqa_3b.sh
│   │       ├── eval_gqa_large.sh
│   │       ├── eval_okvqa.sh
│   │       ├── eval_okvqa_3b.sh
│   │       ├── eval_okvqa_large.sh
│   │       ├── eval_vqav2.sh
│   │       ├── eval_vqav2_3b.sh
│   │       ├── eval_vqav2_large.sh
│   │       ├── eval_vqav2_test.sh
│   │       ├── eval_vqav2_test_3b.sh
│   │       └── eval_vqav2_test_large.sh
│   ├── run_browser.sh
│   └── run_demo.sh
├── setup.py
├── tests/
│   └── models/
│       ├── test_albef.py
│       ├── test_blip.py
│       ├── test_blip2.py
│       └── test_pnp_vqa.py
└── train.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/docs.yaml
================================================
name: docs

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  release:
    types: [ published ]

jobs:
  build:

    runs-on: ubuntu-18.04

    steps:
    - uses: actions/checkout@v2
      with:
        fetch-depth: 0
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.8'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip setuptools wheel
        sudo apt-get update
        sudo apt-get install openjdk-11-jdk
        sudo apt-get install pandoc
    - name: Build Sphinx docs
      run: |
        docs/build_docs.sh
    - name: Deploy to gh-pages
      uses: peaceiris/actions-gh-pages@v3
      if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'release' }}
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: docs/_build/html

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# project-specific
output/
debug*/
*.bak
*.dir
*.dat
*.tsv
*.gz
*.csv
*.p
*.pdf

cache/


================================================
FILE: .pre-commit-config.yaml
================================================
repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.1.0
    hooks:
    -   id: trailing-whitespace
    -   id: check-ast
    -   id: no-commit-to-branch
        args: ['--branch=main']
    -   id: check-added-large-files
        args: ['--maxkb=5000']
    -   id: end-of-file-fixer

-   repo: https://github.com/psf/black
    rev: stable
    hooks:
    - id: black
      language_version: python3.8

-   repo: https://github.com/PyCQA/flake8
    rev: 3.9.2
    hooks:
    -   id: flake8
        args: [
            # only error for syntax errors and undefined names
            "--select=E9,F63,F7,F82",
        ]


================================================
FILE: CODEOWNERS
================================================
# Comment line immediately above ownership line is reserved for related gus information. Please be careful while editing.
#ECCN:Open Source

================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Salesforce Open Source Community Code of Conduct

## About the Code of Conduct

Equality is a core value at Salesforce. We believe a diverse and inclusive
community fosters innovation and creativity, and are committed to building a
culture where everyone feels included.

Salesforce open-source projects are committed to providing a friendly, safe, and
welcoming environment for all, regardless of gender identity and expression,
sexual orientation, disability, physical appearance, body size, ethnicity, nationality, 
race, age, religion, level of experience, education, socioeconomic status, or 
other similar personal characteristics.

The goal of this code of conduct is to specify a baseline standard of behavior so
that people with different social values and communication styles can work
together effectively, productively, and respectfully in our open source community.
It also establishes a mechanism for reporting issues and resolving conflicts.

All questions and reports of abusive, harassing, or otherwise unacceptable behavior
in a Salesforce open-source project may be reported by contacting the Salesforce
Open Source Conduct Committee at ossconduct@salesforce.com.

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of gender 
identity and expression, sexual orientation, disability, physical appearance, 
body size, ethnicity, nationality, race, age, religion, level of experience, education, 
socioeconomic status, or other similar personal characteristics.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy toward other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Personal attacks, insulting/derogatory comments, or trolling
* Public or private harassment
* Publishing, or threatening to publish, others' private information—such as
a physical or electronic address—without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
* Advocating for or encouraging any of the above behaviors

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned with this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project email
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the Salesforce Open Source Conduct Committee 
at ossconduct@salesforce.com. All complaints will be reviewed and investigated 
and will result in a response that is deemed necessary and appropriate to the 
circumstances. The committee is obligated to maintain confidentiality with 
regard to the reporter of an incident. Further details of specific enforcement 
policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership and the Salesforce Open Source Conduct 
Committee.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][contributor-covenant-home],
version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html. 
It includes adaptions and additions from [Go Community Code of Conduct][golang-coc], 
[CNCF Code of Conduct][cncf-coc], and [Microsoft Open Source Code of Conduct][microsoft-coc].

This Code of Conduct is licensed under the [Creative Commons Attribution 3.0 License][cc-by-3-us].

[contributor-covenant-home]: https://www.contributor-covenant.org (https://www.contributor-covenant.org/)
[golang-coc]: https://golang.org/conduct
[cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
[microsoft-coc]: https://opensource.microsoft.com/codeofconduct/
[cc-by-3-us]: https://creativecommons.org/licenses/by/3.0/us/

================================================
FILE: LICENSE.txt
================================================
BSD 3-Clause License

Copyright (c) 2022 Salesforce, Inc.
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


================================================
FILE: MANIFEST.in
================================================
recursive-include lavis/configs *.yaml *.json
recursive-include lavis/projects *.yaml *.json

recursive-exclude lavis/datasets/download_scripts *
recursive-exclude lavis/output *

include requirements.txt
include lavis/models/clip_models/bpe_simple_vocab_16e6.txt.gz


================================================
FILE: README.md
================================================
<p align="center">
    <br>
    <img src="docs/_static/logo_final.png" width="400"/>
    <br>
<p>

<div align="center">
  <a href="https://github.com/salesforce/LAVIS/releases"><img alt="Latest Release" src="https://img.shields.io/github/release/salesforce/LAVIS.svg" /></a>
  <a href="https://opensource.salesforce.com/LAVIS/index.html">
  <img alt="docs" src="https://github.com/salesforce/LAVIS/actions/workflows/docs.yaml/badge.svg"/>
  <a href="https://opensource.org/licenses/BSD-3-Clause">
  <img alt="license" src="https://img.shields.io/badge/License-BSD_3--Clause-blue.svg"/>
  </a> 
  <a href="https://pepy.tech/project/salesforce-lavis">
  <img alt="Downloads" src="https://pepy.tech/badge/salesforce-lavis">
  </a>
</div>

<div align="center">
<a href="https://opensource.salesforce.com/LAVIS//latest/benchmark.html">Benchmark</a>,
<a href="https://arxiv.org/abs/2209.09019">Technical Report</a>,
<a href="https://opensource.salesforce.com/LAVIS//latest/index.html">Documentation</a>,
<a href="https://github.com/salesforce/LAVIS/tree/main/examples">Jupyter Notebook Examples</a>,
<a href="https://blog.salesforceairesearch.com/lavis-language-vision-library/">Blog</a>
</div>

# LAVIS - A Library for Language-Vision Intelligence

## What's New: 🎉 
  * [Model Release] November 2023, released implementation of **X-InstructBLIP** <br>
  [Paper](https://arxiv.org/pdf/2311.18799.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip), [Website](https://artemisp.github.io/X-InstructBLIP-page/), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/xinstructblip/demo/run_demo.ipynb)
  > A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.
  * [Model Release] July 2023, released implementation of **BLIP-Diffusion** <br>
  [Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion), [Website](https://dxli94.github.io/BLIP-Diffusion-website/)
  > A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.
  * [Model Release] May 2023, released implementation of **InstructBLIP** <br>
  [Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)    
  > A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.
  * [Model Release] Jan 2023, released implementation of **BLIP-2** <br>
  [Paper](https://arxiv.org/abs/2301.12597), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb)
  > A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (**65.0** vs **56.3**), establishing new state-of-the-art on zero-shot captioning (on NoCaps **121.6** CIDEr score vs previous best **113.2**). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new **zero-shot instructed vision-to-language generation** capabilities for various interesting applications!
  * Jan 2023, LAVIS is now available on [PyPI](https://pypi.org/project/salesforce-lavis/) for installation!
  * [Model Release] Dec 2022, released implementation of **Img2LLM-VQA** (**CVPR 2023**, _"From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models"_, by Jiaxian Guo et al) <br>
  [Paper](https://arxiv.org/pdf/2212.10846.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
  > A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training! 
  * [Model Release] Oct 2022, released implementation of **PNP-VQA** (**EMNLP Findings 2022**, _"Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training"_, by Anthony T.M.H. et al), <br> 
  [Paper](https://arxiv.org/abs/2210.08773), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb))
  >  A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance. 

## Technical Report and Citing LAVIS
You can find more details in our [technical report](https://arxiv.org/abs/2209.09019).

**If you're using LAVIS in your research or applications, please cite it using this BibTeX**:
```bibtex
@inproceedings{li-etal-2023-lavis,
    title = "{LAVIS}: A One-stop Library for Language-Vision Intelligence",
    author = "Li, Dongxu  and
      Li, Junnan  and
      Le, Hung  and
      Wang, Guangsen  and
      Savarese, Silvio  and
      Hoi, Steven C.H.",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-demo.3",
    pages = "31--41",
    abstract = "We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.",
}
```


## Table of Contents
  - [Introduction](#introduction)
  - [Installation](#installation)
  - [Getting Started](#getting-started)
    - [Model Zoo](#model-zoo)
    - [Image Captioning](#image-captioning)
    - [Visual question answering (VQA)](#visual-question-answering-vqa)
    - [Unified Feature Extraction Interface](#unified-feature-extraction-interface)
    - [Load Datasets](#load-datasets)
  - [Jupyter Notebook Examples](#jupyter-notebook-examples)
  - [Resources and Tools](#resources-and-tools)
  - [Documentations](#documentations)
  - [Ethical and Responsible Use](#ethical-and-responsible-use)
  - [Technical Report and Citing LAVIS](#technical-report-and-citing-lavis)
  - [License](#license)

## Introduction
LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets.
It features a unified interface design to access
- **10+** tasks
(retrieval, captioning, visual question answering, multimodal classification etc.);
- **20+** datasets (COCO, Flickr, Nocaps, Conceptual
Commons, SBU, etc.);
- **30+** pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, including [ALBEF](https://arxiv.org/pdf/2107.07651.pdf),
[BLIP](https://arxiv.org/pdf/2201.12086.pdf), [ALPRO](https://arxiv.org/pdf/2112.09583.pdf), [CLIP](https://arxiv.org/pdf/2103.00020.pdf).
<p align="center">
    <br>
    <img src="assets/demo-6.png"/>
    <br>
<p>

Key features of LAVIS include:

- **Unified and Modular Interface**: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.

- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.

- **Reproducible Model Zoo and Training Recipes**: easily replicate and extend state-of-the-art models on existing and new tasks.

- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.


The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.

|                  Tasks                   |     Supported Models     |             Supported Datasets             |
| :--------------------------------------: | :----------------------: | :----------------------------------------: |
|         Image-text Pre-training          |       ALBEF, BLIP        | COCO, VisualGenome, SBU ConceptualCaptions |
|           Image-text Retrieval           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |
|           Text-image Retrieval           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |
|        Visual Question Answering         |       ALBEF, BLIP        |           VQAv2, OKVQA, A-OKVQA            |
|             Image Captioning             |           BLIP           |                COCO, NoCaps                |
|           Image Classification           |           CLIP           |                  ImageNet                  |
| Natural Language Visual Reasoning (NLVR) |       ALBEF, BLIP        |                   NLVR2                    |
|          Visual Entailment (VE)          |          ALBEF           |                  SNLI-VE                   |
|             Visual Dialogue              |           BLIP           |                  VisDial                   |
|           Video-text Retrieval           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |
|           Text-video Retrieval           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |
|    Video Question Answering (VideoQA)    |       BLIP, ALPRO        |                MSRVTT, MSVD                |
|              Video Dialogue              |         VGD-GPT          |                    AVSD                    |
|      Multimodal Feature Extraction       | ALBEF, CLIP, BLIP, ALPRO |                 customized                 |
|         Text-to-image Generation         |      [COMING SOON]       |                                            |

## Installation

1. (Optional) Creating conda environment

```bash
conda create -n lavis python=3.8
conda activate lavis
```

2. install from [PyPI](https://pypi.org/project/salesforce-lavis/)
```bash
pip install salesforce-lavis
```
    
3. Or, for development, you may build from source

```bash
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
```

## Getting Started
### Model Zoo
Model zoo summarizes supported models in LAVIS, to view:
```python
from lavis.models import model_zoo
print(model_zoo)
# ==================================================
# Architectures                  Types
# ==================================================
# albef_classification           ve
# albef_feature_extractor        base
# albef_nlvr                     nlvr
# albef_pretrain                 base
# albef_retrieval                coco, flickr
# albef_vqa                      vqav2
# alpro_qa                       msrvtt, msvd
# alpro_retrieval                msrvtt, didemo
# blip_caption                   base_coco, large_coco
# blip_classification            base
# blip_feature_extractor         base
# blip_nlvr                      nlvr
# blip_pretrain                  base
# blip_retrieval                 coco, flickr
# blip_vqa                       vqav2, okvqa, aokvqa
# clip_feature_extractor         ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
# clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
# gpt_dialogue                   base
```

Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.

```python
import torch
from PIL import Image
# setup device to use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load sample image
raw_image = Image.open("docs/_static/merlion.png").convert("RGB")
```

This example image shows [Merlion park](https://en.wikipedia.org/wiki/Merlion) ([source](https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/)), a landmark in Singapore.


### Image Captioning
In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
pre-trained model with its preprocessors (transforms), accessed via ``load_model_and_preprocess()``.

```python
import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']
```

### Visual question answering (VQA)
BLIP model is able to answer free-form questions about images in natural language.
To access the VQA model, simply replace the ``name`` and ``model_type`` arguments
passed to ``load_model_and_preprocess()``.

```python
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
# ask a random question.
question = "Which city is this photo taken?"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)
model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
# ['singapore']
```

### Unified Feature Extraction Interface

LAVIS provides a unified interface to extract features from each architecture. 
To extract features, we load the feature extractor variants of each model.
The multimodal feature can be used for multimodal classification.
The low-dimensional unimodal features can be used to compute cross-modal similarity.


```python
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
caption = "a large fountain spewing water into the air"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text_input = txt_processors["eval"](caption)
sample = {"image": image, "text_input": [text_input]}

features_multimodal = model.extract_features(sample)
print(features_multimodal.multimodal_embeds.shape)
# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks

features_image = model.extract_features(sample, mode="image")
features_text = model.extract_features(sample, mode="text")
print(features_image.image_embeds.shape)
# torch.Size([1, 197, 768])
print(features_text.text_embeds.shape)
# torch.Size([1, 12, 768])

# low-dimensional projected features
print(features_image.image_embeds_proj.shape)
# torch.Size([1, 197, 256])
print(features_text.text_embeds_proj.shape)
# torch.Size([1, 12, 256])
similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()
print(similarity)
# tensor([[0.2622]])
```

### Load Datasets
LAVIS inherently supports a wide variety of common language-vision datasets by providing [automatic download tools](https://opensource.salesforce.com/LAVIS//latest/benchmark) to help download and organize these datasets. After downloading, to load the datasets, use the following code:

```python
from lavis.datasets.builders import dataset_zoo
dataset_names = dataset_zoo.get_names()
print(dataset_names)
# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
#  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
#  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
#  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
```
After downloading the images, we can use ``load_dataset()`` to obtain the dataset.
```python
from lavis.datasets.builders import load_dataset
coco_dataset = load_dataset("coco_caption")
print(coco_dataset.keys())
# dict_keys(['train', 'val', 'test'])
print(len(coco_dataset["train"]))
# 566747
print(coco_dataset["train"][0])
# {'image': <PIL.Image.Image image mode=RGB size=640x480>,
#  'text_input': 'A woman wearing a net on her head cutting a cake. ',
#  'image_id': 0}
```

If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.

```python
coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)
```

## Jupyter Notebook Examples
See [examples](https://github.com/salesforce/LAVIS/tree/main/examples) for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.

## Resources and Tools
- **Benchmarks**: see [Benchmark](https://opensource.salesforce.com/LAVIS//latest/benchmark) for instructions to evaluate and train supported models.
- **Dataset Download and Browsing**: see [Dataset Download](https://opensource.salesforce.com/LAVIS//latest/benchmark) for instructions and automatic tools on download common language-vision datasets.
- **GUI Demo**: to run the demo locally, run ```bash run_scripts/run_demo.sh``` and then follow the instruction on the prompts to view in browser. A web demo is coming soon.


## Documentations
For more details and advanced usages, please refer to
[documentation](https://opensource.salesforce.com/LAVIS//latest/index.html#).

## Ethical and Responsible Use
We note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases and
inappropriate behaviors in the future.


## Contact us
If you have any questions, comments or suggestions, please do not hesitate to contact us at lavis@salesforce.com.

## License
[BSD 3-Clause License](LICENSE.txt)


================================================
FILE: SECURITY.md
================================================
## Security

Please report any security issue to [security@salesforce.com](mailto:security@salesforce.com)
as soon as it is discovered. This library limits its runtime dependencies in
order to reduce the total cost of ownership as much as can be, but all consumers
should remain vigilant and have their security stakeholders review all third-party
products (3PP) like this one and their dependencies.

================================================
FILE: app/__init__.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

from PIL import Image
import requests

import streamlit as st
import torch


@st.cache()
def load_demo_image():
    img_url = (
        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
    )
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
    return raw_image


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

cache_root = "/export/home/.cache/lavis/"


================================================
FILE: app/calculate_coco_features.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

from PIL import Image
import requests
import torch

import os

from lavis.common.registry import registry
from lavis.processors import *
from lavis.models import *
from lavis.common.utils import build_default_model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def load_demo_image():
    img_url = (
        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
    )
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")

    return raw_image


def read_img(filepath):
    raw_image = Image.open(filepath).convert("RGB")

    return raw_image


# model
model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
feature_extractor = BlipFeatureExtractor(pretrained=model_url)

feature_extractor.eval()
feature_extractor = feature_extractor.to(device)

# preprocessors
vis_processor = BlipImageEvalProcessor(image_size=224)
text_processor = BlipCaptionProcessor()

# files to process
# file_root = "/export/home/.cache/lavis/coco/images/val2014"
file_root = "/export/home/.cache/lavis/coco/images/train2014"
filepaths = os.listdir(file_root)

print(len(filepaths))

caption = "dummy"

path2feat = dict()
bsz = 256

images_in_batch = []
filepaths_in_batch = []

for i, filename in enumerate(filepaths):
    if i % bsz == 0 and i > 0:
        images_in_batch = torch.cat(images_in_batch, dim=0).to(device)
        with torch.no_grad():
            image_features = feature_extractor(
                images_in_batch, caption, mode="image", normalized=True
            )[:, 0]

        for filepath, image_feat in zip(filepaths_in_batch, image_features):
            path2feat[os.path.basename(filepath)] = image_feat.detach().cpu()

        images_in_batch = []
        filepaths_in_batch = []

        print(len(path2feat), image_features.shape)
    else:
        filepath = os.path.join(file_root, filename)

        image = read_img(filepath)
        image = vis_processor(image).unsqueeze(0)

        images_in_batch.append(image)
        filepaths_in_batch.append(filepath)

torch.save(path2feat, "path2feat_coco_train2014.pth")


================================================
FILE: app/caption.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import streamlit as st
from app import device, load_demo_image
from app.utils import load_model_cache
from lavis.processors import load_processor
from PIL import Image


def app():
    # ===== layout =====
    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])

    sampling_method = st.sidebar.selectbox(
        "Sampling method:", ["Beam search", "Nucleus sampling"]
    )

    st.markdown(
        "<h1 style='text-align: center;'>Image Description Generation</h1>",
        unsafe_allow_html=True,
    )

    instructions = """Try the provided image or upload your own:"""
    file = st.file_uploader(instructions)

    use_beam = sampling_method == "Beam search"

    col1, col2 = st.columns(2)

    if file:
        raw_img = Image.open(file).convert("RGB")
    else:
        raw_img = load_demo_image()

    col1.header("Image")

    w, h = raw_img.size
    scaling_factor = 720 / w
    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))

    col1.image(resized_image, use_column_width=True)
    col2.header("Description")

    cap_button = st.button("Generate")

    # ==== event ====
    vis_processor = load_processor("blip_image_eval").build(image_size=384)

    if cap_button:
        if model_type.startswith("BLIP"):
            blip_type = model_type.split("_")[1].lower()
            model = load_model_cache(
                "blip_caption",
                model_type=f"{blip_type}_coco",
                is_eval=True,
                device=device,
            )

        img = vis_processor(raw_img).unsqueeze(0).to(device)
        captions = generate_caption(
            model=model, image=img, use_nucleus_sampling=not use_beam
        )

        col2.write("\n\n".join(captions), use_column_width=True)


def generate_caption(
    model, image, use_nucleus_sampling=False, num_beams=3, max_length=40, min_length=5
):
    samples = {"image": image}

    captions = []
    if use_nucleus_sampling:
        for _ in range(5):
            caption = model.generate(
                samples,
                use_nucleus_sampling=True,
                max_length=max_length,
                min_length=min_length,
                top_p=0.9,
            )
            captions.append(caption[0])
    else:
        caption = model.generate(
            samples,
            use_nucleus_sampling=False,
            num_beams=num_beams,
            max_length=max_length,
            min_length=min_length,
        )
        captions.append(caption[0])

    return captions


================================================
FILE: app/classification.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import plotly.graph_objects as go
import requests
import streamlit as st
import torch
from lavis.models import load_model
from lavis.processors import load_processor
from lavis.processors.blip_processors import BlipCaptionProcessor
from PIL import Image

from app import device, load_demo_image
from app.utils import load_blip_itm_model
from lavis.processors.clip_processors import ClipImageEvalProcessor


@st.cache()
def load_demo_image(img_url=None):
    if not img_url:
        img_url = "https://img.atlasobscura.com/yDJ86L8Ou6aIjBsxnlAy5f164w1rjTgcHZcx2yUs4mo/rt:fit/w:1200/q:81/sm:1/scp:1/ar:1/aHR0cHM6Ly9hdGxh/cy1kZXYuczMuYW1h/em9uYXdzLmNvbS91/cGxvYWRzL3BsYWNl/X2ltYWdlcy85MDll/MDRjOS00NTJjLTQx/NzQtYTY4MS02NmQw/MzI2YWIzNjk1ZGVk/MGZhMTJiMTM5MmZi/NGFfUmVhcl92aWV3/X29mX3RoZV9NZXJs/aW9uX3N0YXR1ZV9h/dF9NZXJsaW9uX1Bh/cmssX1NpbmdhcG9y/ZSxfd2l0aF9NYXJp/bmFfQmF5X1NhbmRz/X2luX3RoZV9kaXN0/YW5jZV8tXzIwMTQw/MzA3LmpwZw.jpg"
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
    return raw_image


@st.cache(
    hash_funcs={
        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
        .cpu()
        .numpy()
    },
    allow_output_mutation=True,
)
def load_model_cache(model_type, device):
    if model_type == "blip":
        model = load_model(
            "blip_feature_extractor", model_type="base", is_eval=True, device=device
        )
    elif model_type == "albef":
        model = load_model(
            "albef_feature_extractor", model_type="base", is_eval=True, device=device
        )
    elif model_type == "CLIP_ViT-B-32":
        model = load_model(
            "clip_feature_extractor", "ViT-B-32", is_eval=True, device=device
        )
    elif model_type == "CLIP_ViT-B-16":
        model = load_model(
            "clip_feature_extractor", "ViT-B-16", is_eval=True, device=device
        )
    elif model_type == "CLIP_ViT-L-14":
        model = load_model(
            "clip_feature_extractor", "ViT-L-14", is_eval=True, device=device
        )

    return model


def app():
    model_type = st.sidebar.selectbox(
        "Model:",
        ["ALBEF", "BLIP_Base", "CLIP_ViT-B-32", "CLIP_ViT-B-16", "CLIP_ViT-L-14"],
    )
    score_type = st.sidebar.selectbox("Score type:", ["Cosine", "Multimodal"])

    # ===== layout =====
    st.markdown(
        "<h1 style='text-align: center;'>Zero-shot Classification</h1>",
        unsafe_allow_html=True,
    )

    instructions = """Try the provided image or upload your own:"""
    file = st.file_uploader(instructions)

    st.header("Image")
    if file:
        raw_img = Image.open(file).convert("RGB")
    else:
        raw_img = load_demo_image()

    st.image(raw_img)  # , use_column_width=True)

    col1, col2 = st.columns(2)

    col1.header("Categories")

    cls_0 = col1.text_input("category 1", value="merlion")
    cls_1 = col1.text_input("category 2", value="sky")
    cls_2 = col1.text_input("category 3", value="giraffe")
    cls_3 = col1.text_input("category 4", value="fountain")
    cls_4 = col1.text_input("category 5", value="marina bay")

    cls_names = [cls_0, cls_1, cls_2, cls_3, cls_4]
    cls_names = [cls_nm for cls_nm in cls_names if len(cls_nm) > 0]

    if len(cls_names) != len(set(cls_names)):
        st.error("Please provide unique class names")
        return

    button = st.button("Submit")

    col2.header("Prediction")

    # ===== event =====

    if button:
        if model_type.startswith("BLIP"):
            text_processor = BlipCaptionProcessor(prompt="A picture of ")
            cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]

            if score_type == "Cosine":
                vis_processor = load_processor("blip_image_eval").build(image_size=224)
                img = vis_processor(raw_img).unsqueeze(0).to(device)

                feature_extractor = load_model_cache(model_type="blip", device=device)

                sample = {"image": img, "text_input": cls_prompt}

                with torch.no_grad():
                    image_features = feature_extractor.extract_features(
                        sample, mode="image"
                    ).image_embeds_proj[:, 0]
                    text_features = feature_extractor.extract_features(
                        sample, mode="text"
                    ).text_embeds_proj[:, 0]
                    sims = (image_features @ text_features.t())[
                        0
                    ] / feature_extractor.temp

            else:
                vis_processor = load_processor("blip_image_eval").build(image_size=384)
                img = vis_processor(raw_img).unsqueeze(0).to(device)

                model = load_blip_itm_model(device)

                output = model(img, cls_prompt, match_head="itm")
                sims = output[:, 1]

            sims = torch.nn.Softmax(dim=0)(sims)
            inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]

        elif model_type.startswith("ALBEF"):
            vis_processor = load_processor("blip_image_eval").build(image_size=224)
            img = vis_processor(raw_img).unsqueeze(0).to(device)

            text_processor = BlipCaptionProcessor(prompt="A picture of ")
            cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]

            feature_extractor = load_model_cache(model_type="albef", device=device)

            sample = {"image": img, "text_input": cls_prompt}

            with torch.no_grad():
                image_features = feature_extractor.extract_features(
                    sample, mode="image"
                ).image_embeds_proj[:, 0]
                text_features = feature_extractor.extract_features(
                    sample, mode="text"
                ).text_embeds_proj[:, 0]

                st.write(image_features.shape)
                st.write(text_features.shape)

                sims = (image_features @ text_features.t())[0] / feature_extractor.temp

            sims = torch.nn.Softmax(dim=0)(sims)
            inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]

        elif model_type.startswith("CLIP"):
            if model_type == "CLIP_ViT-B-32":
                model = load_model_cache(model_type="CLIP_ViT-B-32", device=device)
            elif model_type == "CLIP_ViT-B-16":
                model = load_model_cache(model_type="CLIP_ViT-B-16", device=device)
            elif model_type == "CLIP_ViT-L-14":
                model = load_model_cache(model_type="CLIP_ViT-L-14", device=device)
            else:
                raise ValueError(f"Unknown model type {model_type}")

            if score_type == "Cosine":
                # image_preprocess = ClipImageEvalProcessor(image_size=336)
                image_preprocess = ClipImageEvalProcessor(image_size=224)
                img = image_preprocess(raw_img).unsqueeze(0).to(device)

                sample = {"image": img, "text_input": cls_names}

                with torch.no_grad():
                    clip_features = model.extract_features(sample)

                    image_features = clip_features.image_embeds_proj
                    text_features = clip_features.text_embeds_proj

                    sims = (100.0 * image_features @ text_features.T)[0].softmax(dim=-1)
                    inv_sims = sims.tolist()[::-1]
            else:
                st.warning("CLIP does not support multimodal scoring.")
                return

        fig = go.Figure(
            go.Bar(
                x=inv_sims,
                y=cls_names[::-1],
                text=["{:.2f}".format(s) for s in inv_sims],
                orientation="h",
            )
        )
        fig.update_traces(
            textfont_size=12,
            textangle=0,
            textposition="outside",
            cliponaxis=False,
        )
        col2.plotly_chart(fig, use_container_width=True)


================================================
FILE: app/dataset_browser.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import random
from collections import OrderedDict
from functools import reduce
from tkinter import N

import streamlit as st
from lavis.common.registry import registry
from lavis.datasets.builders import dataset_zoo, load_dataset
from lavis.datasets.builders.base_dataset_builder import load_dataset_config
from PIL import Image

IMAGE_LAYOUT = 3, 4
VIDEO_LAYOUT = 1, 2

PREV_STR = "Prev"
NEXT_STR = "Next"


def sample_dataset(dataset, indices):
    samples = [dataset.displ_item(idx) for idx in indices]

    return samples


def get_concat_v(im1, im2):
    margin = 5

    canvas_size = (im1.width + im2.width + margin, max(im1.height, im2.height))
    canvas = Image.new("RGB", canvas_size, "White")
    canvas.paste(im1, (0, 0))
    canvas.paste(im2, (im1.width + margin, 0))

    return canvas


def resize_img_w(raw_img, new_w=224):
    if isinstance(raw_img, list):
        resized_imgs = [resize_img_w(img, 196) for img in raw_img]
        # concatenate images
        resized_image = reduce(get_concat_v, resized_imgs)
    else:
        w, h = raw_img.size
        scaling_factor = new_w / w
        resized_image = raw_img.resize(
            (int(w * scaling_factor), int(h * scaling_factor))
        )

    return resized_image


def get_visual_key(dataset):
    if "image" in dataset[0]:
        return "image"
    elif "image0" in dataset[0]:  # NLVR2 dataset
        return "image"
    elif "video" in dataset[0]:
        return "video"
    else:
        raise ValueError("Visual key not found.")


def gather_items(samples, exclude=[]):
    gathered = []

    for s in samples:
        ns = OrderedDict()
        for k in s.keys():
            if k not in exclude:
                ns[k] = s[k]

        gathered.append(ns)

    return gathered


@st.cache(allow_output_mutation=True)
def load_dataset_cache(name):
    return load_dataset(name)


def format_text(text):
    md = "\n\n".join([f"**{k}**: {v}" for k, v in text.items()])

    return md


def show_samples(dataset, offset=0, is_next=False):
    visual_key = get_visual_key(dataset)

    num_rows, num_cols = IMAGE_LAYOUT if visual_key == "image" else VIDEO_LAYOUT
    n_samples = num_rows * num_cols

    if not shuffle:
        if is_next:
            start = min(int(start_idx) + offset + n_samples, len(dataset) - n_samples)
        else:
            start = max(0, int(start_idx) + offset - n_samples)

        st.session_state.last_start = start
        end = min(start + n_samples, len(dataset))

        indices = list(range(start, end))
    else:
        indices = random.sample(range(len(dataset)), n_samples)
    samples = sample_dataset(dataset, indices)

    visual_info = (
        iter([resize_img_w(s[visual_key]) for s in samples])
        if visual_key == "image"
        # else iter([s[visual_key] for s in samples])
        else iter([s["file"] for s in samples])
    )
    text_info = gather_items(samples, exclude=["image", "video"])
    text_info = iter([format_text(s) for s in text_info])

    st.markdown(
        """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
        unsafe_allow_html=True,
    )
    for _ in range(num_rows):
        with st.container():
            for col in st.columns(num_cols):
                # col.text(next(text_info))
                # col.caption(next(text_info))
                try:
                    col.markdown(next(text_info))
                    if visual_key == "image":
                        col.image(next(visual_info), use_column_width=True, clamp=True)
                    elif visual_key == "video":
                        col.markdown(
                            "![Alt Text](https://media.giphy.com/media/vFKqnCdLPNOKc/giphy.gif)"
                        )
                except StopIteration:
                    break

            st.markdown(
                """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
                unsafe_allow_html=True,
            )

    st.session_state.n_display = n_samples


if __name__ == "__main__":
    st.set_page_config(
        page_title="LAVIS Dataset Explorer",
        # layout="wide",
        initial_sidebar_state="expanded",
    )

    dataset_name = st.sidebar.selectbox("Dataset:", dataset_zoo.get_names())

    function = st.sidebar.selectbox("Function:", ["Browser"], index=0)

    if function == "Browser":
        shuffle = st.sidebar.selectbox("Shuffled:", [True, False], index=0)

        dataset = load_dataset_cache(dataset_name)
        split = st.sidebar.selectbox("Split:", dataset.keys())

        dataset_len = len(dataset[split])
        st.success(
            f"Loaded {dataset_name}/{split} with **{dataset_len}** records.  **Image/video directory**: {dataset[split].vis_root}"
        )

        if "last_dataset" not in st.session_state:
            st.session_state.last_dataset = dataset_name
            st.session_state.last_split = split

        if "last_start" not in st.session_state:
            st.session_state.last_start = 0

        if "start_idx" not in st.session_state:
            st.session_state.start_idx = 0

        if "shuffle" not in st.session_state:
            st.session_state.shuffle = shuffle

        if "first_run" not in st.session_state:
            st.session_state.first_run = True
        elif (
            st.session_state.last_dataset != dataset_name
            or st.session_state.last_split != split
        ):
            st.session_state.first_run = True

            st.session_state.last_dataset = dataset_name
            st.session_state.last_split = split
        elif st.session_state.shuffle != shuffle:
            st.session_state.shuffle = shuffle
            st.session_state.first_run = True

        if not shuffle:
            n_col, p_col = st.columns([0.05, 1])

            prev_button = n_col.button(PREV_STR)
            next_button = p_col.button(NEXT_STR)

        else:
            next_button = st.button(NEXT_STR)

        if not shuffle:
            start_idx = st.sidebar.text_input(f"Begin from (total {dataset_len})", 0)

            if not start_idx.isdigit():
                st.error(f"Input to 'Begin from' must be digits, found {start_idx}.")
            else:
                if int(start_idx) != st.session_state.start_idx:
                    st.session_state.start_idx = int(start_idx)
                    st.session_state.last_start = int(start_idx)

            if prev_button:
                show_samples(
                    dataset[split],
                    offset=st.session_state.last_start - st.session_state.start_idx,
                    is_next=False,
                )

        if next_button:
            show_samples(
                dataset[split],
                offset=st.session_state.last_start - st.session_state.start_idx,
                is_next=True,
            )

        if st.session_state.first_run:
            st.session_state.first_run = False

            show_samples(
                dataset[split],
                offset=st.session_state.last_start - st.session_state.start_idx,
                is_next=True,
            )


================================================
FILE: app/image_text_match.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import numpy as np
import streamlit as st
import torch
from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
from lavis.processors import load_processor
from PIL import Image

from app import device, load_demo_image
from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model


def app():
    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])

    if model_type.startswith("BLIP"):
        blip_type = model_type.split("_")[1]
        model = load_blip_itm_model(device, model_type=blip_type)

    vis_processor = load_processor("blip_image_eval").build(image_size=384)

    st.markdown(
        "<h1 style='text-align: center;'>Image Text Matching</h1>",
        unsafe_allow_html=True,
    )

    values = list(range(1, 12))
    default_layer_num = values.index(7)
    layer_num = (
        st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
    )

    instructions = """Try the provided image or upload your own:"""
    file = st.file_uploader(instructions)

    col1, col2 = st.columns(2)
    col1.header("Image")
    col2.header("GradCam")
    if file:
        raw_img = Image.open(file).convert("RGB")
    else:
        raw_img = load_demo_image()

    w, h = raw_img.size
    scaling_factor = 720 / w
    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
    col1.image(resized_image, use_column_width=True)

    col3, col4 = st.columns(2)
    col3.header("Text")
    user_question = col3.text_input(
        "Input your sentence!", "a woman sitting on the beach with a dog"
    )
    submit_button = col3.button("Submit")

    col4.header("Matching score")

    if submit_button:
        tokenizer = init_bert_tokenizer()

        img = vis_processor(raw_img).unsqueeze(0).to(device)
        text_processor = load_processor("blip_caption").build()

        qry = text_processor(user_question)

        norm_img = np.float32(resized_image) / 255

        qry_tok = tokenizer(qry, return_tensors="pt").to(device)
        gradcam, output = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)

        avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)

        col2.image(avg_gradcam, use_column_width=True, clamp=True)
        # output = model(img, question)
        itm_score = torch.nn.functional.softmax(output, dim=1)
        new_title = (
            '<p style="text-align: left; font-size: 25px;">\n{:.3f}%</p>'.format(
                itm_score[0][1].item() * 100
            )
        )
        col4.markdown(new_title, unsafe_allow_html=True)


================================================
FILE: app/main.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

from app.multipage import MultiPage
from app import vqa, caption
from app import image_text_match as itm
from app import text_localization as tl
from app import multimodal_search as ms
from app import classification as cl


if __name__ == "__main__":
    app = MultiPage()

    app.add_page("Image Description Generation", caption.app)
    app.add_page("Multimodal Search", ms.app)
    app.add_page("Visual Question Answering", vqa.app)
    app.add_page("Image Text Matching", itm.app)
    app.add_page("Text Localization", tl.app)
    app.add_page("Classification", cl.app)
    app.run()


================================================
FILE: app/multimodal_search.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import os

import numpy as np
import streamlit as st
import torch
import torch.nn.functional as F
from app import cache_root, device
from app.utils import (
    getAttMap,
    init_bert_tokenizer,
    load_blip_itm_model,
    read_img,
    resize_img,
)
from lavis.models import load_model
from lavis.processors import load_processor


@st.cache(
    hash_funcs={
        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
        .cpu()
        .numpy()
    },
    allow_output_mutation=True,
)
def load_feat():
    from lavis.common.utils import download_url

    dirname = os.path.join(os.path.dirname(__file__), "assets")
    filename = "path2feat_coco_train2014.pth"
    filepath = os.path.join(dirname, filename)
    url = "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/path2feat_coco_train2014.pth"

    if not os.path.exists(filepath):
        download_url(url=url, root=dirname, filename="path2feat_coco_train2014.pth")

    path2feat = torch.load(filepath)
    paths = sorted(path2feat.keys())

    all_img_feats = torch.stack([path2feat[k] for k in paths], dim=0).to(device)

    return path2feat, paths, all_img_feats


@st.cache(
    hash_funcs={
        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
        .cpu()
        .numpy()
    },
    allow_output_mutation=True,
)
def load_feature_extractor_model(device):
    model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"

    model = load_model(
        "blip_feature_extractor", model_type="base", is_eval=True, device=device
    )
    model.load_from_pretrained(model_url)

    return model


def app():
    # === layout ===
    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
    file_root = os.path.join(cache_root, "coco/images/train2014/")

    values = [12, 24, 48]
    default_layer_num = values.index(24)
    num_display = st.sidebar.selectbox(
        "Number of images:", values, index=default_layer_num
    )
    show_gradcam = st.sidebar.selectbox("Show GradCam:", [True, False], index=1)
    itm_ranking = st.sidebar.selectbox("Multimodal re-ranking:", [True, False], index=0)

    # st.title('Multimodal Search')
    st.markdown(
        "<h1 style='text-align: center;'>Multimodal Search</h1>", unsafe_allow_html=True
    )

    # === event ===
    vis_processor = load_processor("blip_image_eval").build(image_size=384)
    text_processor = load_processor("blip_caption")

    user_question = st.text_input(
        "Search query", "A dog running on the grass.", help="Type something to search."
    )
    user_question = text_processor(user_question)
    feature_extractor = load_feature_extractor_model(device)

    # ======= ITC =========
    sample = {"text_input": user_question}

    with torch.no_grad():
        text_feature = feature_extractor.extract_features(
            sample, mode="text"
        ).text_embeds_proj[0, 0]

        path2feat, paths, all_img_feats = load_feat()
        all_img_feats.to(device)
        all_img_feats = F.normalize(all_img_feats, dim=1)

        num_cols = 4
        num_rows = int(num_display / num_cols)

        similarities = text_feature @ all_img_feats.T
        indices = torch.argsort(similarities, descending=True)[:num_display]

    top_paths = [paths[ind.detach().cpu().item()] for ind in indices]
    sorted_similarities = [similarities[idx] for idx in indices]
    filenames = [os.path.join(file_root, p) for p in top_paths]

    # ========= ITM and GradCam ==========
    bsz = 4  # max number of images to avoid cuda oom
    if model_type.startswith("BLIP"):
        blip_type = model_type.split("_")[1]

    itm_model = load_blip_itm_model(device, model_type=blip_type)

    tokenizer = init_bert_tokenizer()
    queries_batch = [user_question] * bsz
    queries_tok_batch = tokenizer(queries_batch, return_tensors="pt").to(device)

    num_batches = int(num_display / bsz)

    avg_gradcams = []
    all_raw_images = []
    itm_scores = []

    for i in range(num_batches):
        filenames_in_batch = filenames[i * bsz : (i + 1) * bsz]
        raw_images, images = read_and_process_images(filenames_in_batch, vis_processor)
        gradcam, itm_output = compute_gradcam_batch(
            itm_model, images, queries_batch, queries_tok_batch
        )

        all_raw_images.extend([resize_img(r_img) for r_img in raw_images])
        norm_imgs = [np.float32(r_img) / 255 for r_img in raw_images]

        for norm_img, grad_cam in zip(norm_imgs, gradcam):
            avg_gradcam = getAttMap(norm_img, grad_cam[0], blur=True)
            avg_gradcams.append(avg_gradcam)

        with torch.no_grad():
            itm_score = torch.nn.functional.softmax(itm_output, dim=1)

        itm_scores.append(itm_score)

    # ========= ITM re-ranking =========
    itm_scores = torch.cat(itm_scores)[:, 1]
    if itm_ranking:
        itm_scores_sorted, indices = torch.sort(itm_scores, descending=True)

        avg_gradcams_sorted = []
        all_raw_images_sorted = []
        for idx in indices:
            avg_gradcams_sorted.append(avg_gradcams[idx])
            all_raw_images_sorted.append(all_raw_images[idx])

        avg_gradcams = avg_gradcams_sorted
        all_raw_images = all_raw_images_sorted

    if show_gradcam:
        images_to_show = iter(avg_gradcams)
    else:
        images_to_show = iter(all_raw_images)

    for _ in range(num_rows):
        with st.container():
            for col in st.columns(num_cols):
                col.image(next(images_to_show), use_column_width=True, clamp=True)


def read_and_process_images(image_paths, vis_processor):
    raw_images = [read_img(path) for path in image_paths]
    images = [vis_processor(r_img) for r_img in raw_images]
    images_tensors = torch.stack(images).to(device)

    return raw_images, images_tensors


def compute_gradcam_batch(model, visual_input, text_input, tokenized_text, block_num=6):
    model.text_encoder.base_model.base_model.encoder.layer[
        block_num
    ].crossattention.self.save_attention = True

    output = model({"image": visual_input, "text_input": text_input}, match_head="itm")
    loss = output[:, 1].sum()

    model.zero_grad()
    loss.backward()
    with torch.no_grad():
        mask = tokenized_text.attention_mask.view(
            tokenized_text.attention_mask.size(0), 1, -1, 1, 1
        )  # (bsz,1,token_len, 1,1)
        token_length = mask.sum() - 2
        token_length = token_length.cpu()
        # grads and cams [bsz, num_head, seq_len, image_patch]
        grads = model.text_encoder.base_model.base_model.encoder.layer[
            block_num
        ].crossattention.self.get_attn_gradients()
        cams = model.text_encoder.base_model.base_model.encoder.layer[
            block_num
        ].crossattention.self.get_attention_map()

        # assume using vit large with 576 num image patch
        cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
        grads = (
            grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24)
            * mask
        )

        gradcam = cams * grads
        # [enc token gradcam, average gradcam across token, gradcam for individual token]
        # gradcam = torch.cat((gradcam[0:1,:], gradcam[1:token_length+1, :].sum(dim=0, keepdim=True)/token_length, gradcam[1:, :]))
        gradcam = gradcam.mean(1).cpu().detach()
        gradcam = (
            gradcam[:, 1 : token_length + 1, :].sum(dim=1, keepdim=True) / token_length
        )

    return gradcam, output


================================================
FILE: app/multipage.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

"""
This file is the framework for generating multiple Streamlit applications
through an object oriented framework.
"""

# Import necessary libraries
import streamlit as st

# Define the multipage class to manage the multiple apps in our program
class MultiPage:
    """Framework for combining multiple streamlit applications."""

    def __init__(self) -> None:
        """Constructor class to generate a list which will store all our applications as an instance variable."""
        self.pages = []

    def add_page(self, title, func) -> None:
        """Class Method to Add pages to the project
        Args:
            title ([str]): The title of page which we are adding to the list of apps

            func: Python function to render this page in Streamlit
        """

        self.pages.append({"title": title, "function": func})

    def run(self):
        # Drodown to select the page to run
        page = st.sidebar.selectbox(
            "Navigation", self.pages, format_func=lambda page: page["title"]
        )

        # run the app function
        page["function"]()


================================================
FILE: app/text_localization.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import math

import numpy as np
import streamlit as st
from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
from lavis.processors import load_processor
from PIL import Image

from app import device, load_demo_image
from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model


def app():
    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])

    values = list(range(1, 12))
    default_layer_num = values.index(7)
    layer_num = (
        st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
    )

    st.markdown(
        "<h1 style='text-align: center;'>Text Localization</h1>", unsafe_allow_html=True
    )

    vis_processor = load_processor("blip_image_eval").build(image_size=384)
    text_processor = load_processor("blip_caption")

    tokenizer = init_bert_tokenizer()

    instructions = "Try the provided image and text or use your own ones."
    file = st.file_uploader(instructions)

    query = st.text_input(
        "Try a different input.", "A girl playing with her dog on the beach."
    )

    submit_button = st.button("Submit")

    col1, col2 = st.columns(2)

    if file:
        raw_img = Image.open(file).convert("RGB")
    else:
        raw_img = load_demo_image()

    col1.header("Image")
    w, h = raw_img.size
    scaling_factor = 720 / w
    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
    col1.image(resized_image, use_column_width=True)

    col2.header("GradCam")

    if submit_button:
        if model_type.startswith("BLIP"):
            blip_type = model_type.split("_")[1]
            model = load_blip_itm_model(device, model_type=blip_type)

        img = vis_processor(raw_img).unsqueeze(0).to(device)
        qry = text_processor(query)

        qry_tok = tokenizer(qry, return_tensors="pt").to(device)

        norm_img = np.float32(resized_image) / 255

        gradcam, _ = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)

        avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
        col2.image(avg_gradcam, use_column_width=True, clamp=True)

        num_cols = 4.0
        num_tokens = len(qry_tok.input_ids[0]) - 2

        num_rows = int(math.ceil(num_tokens / num_cols))

        gradcam_iter = iter(gradcam[0][2:-1])
        token_id_iter = iter(qry_tok.input_ids[0][1:-1])

        for _ in range(num_rows):
            with st.container():
                for col in st.columns(int(num_cols)):
                    token_id = next(token_id_iter, None)
                    if not token_id:
                        break
                    gradcam_img = next(gradcam_iter)

                    word = tokenizer.decode([token_id])
                    gradcam_todraw = getAttMap(norm_img, gradcam_img, blur=True)

                    new_title = (
                        '<p style="text-align: center; font-size: 25px;">{}</p>'.format(
                            word
                        )
                    )
                    col.markdown(new_title, unsafe_allow_html=True)
                    # st.image(image, channels="BGR")
                    col.image(gradcam_todraw, use_column_width=True, clamp=True)


================================================
FILE: app/utils.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import numpy as np
import streamlit as st
import torch
from lavis.models import BlipBase, load_model
from matplotlib import pyplot as plt
from PIL import Image
from scipy.ndimage import filters
from skimage import transform as skimage_transform


def resize_img(raw_img):
    w, h = raw_img.size
    scaling_factor = 240 / w
    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
    return resized_image


def read_img(filepath):
    raw_image = Image.open(filepath).convert("RGB")

    return raw_image


@st.cache(
    hash_funcs={
        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
        .cpu()
        .numpy()
    },
    allow_output_mutation=True,
)
def load_model_cache(name, model_type, is_eval, device):
    return load_model(name, model_type, is_eval, device)


@st.cache(allow_output_mutation=True)
def init_bert_tokenizer():
    tokenizer = BlipBase.init_tokenizer()
    return tokenizer


def getAttMap(img, attMap, blur=True, overlap=True):
    attMap -= attMap.min()
    if attMap.max() > 0:
        attMap /= attMap.max()
    attMap = skimage_transform.resize(attMap, (img.shape[:2]), order=3, mode="constant")
    if blur:
        attMap = filters.gaussian_filter(attMap, 0.02 * max(img.shape[:2]))
        attMap -= attMap.min()
        attMap /= attMap.max()
    cmap = plt.get_cmap("jet")
    attMapV = cmap(attMap)
    attMapV = np.delete(attMapV, 3, 2)
    if overlap:
        attMap = (
            1 * (1 - attMap**0.7).reshape(attMap.shape + (1,)) * img
            + (attMap**0.7).reshape(attMap.shape + (1,)) * attMapV
        )
    return attMap


@st.cache(
    hash_funcs={
        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
        .cpu()
        .numpy()
    },
    allow_output_mutation=True,
)
def load_blip_itm_model(device, model_type="base"):
    model = load_model(
        "blip_image_text_matching", model_type, is_eval=True, device=device
    )
    return model


================================================
FILE: app/vqa.py
================================================
"""
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""

import streamlit as st
from app import load_demo_image, device
from app.utils import load_model_cache
from lavis.processors import load_processor
from PIL import Image


def app():
    model_type = st.sidebar.selectbox("Model:", ["BLIP"])

    # ===== layout =====
    st.markdown(
        "<h1 style='text-align: center;'>Visual Question Answering</h1>",
        unsafe_allow_html=True,
    )

    instructions = """Try the provided image or upload your own:"""
    file = st.file_uploader(instructions)

    col1, col2 = st.columns(2)

    col1.header("Image")
    if file:
        raw_img = Image.open(file).convert("RGB")
    else:
        raw_img = load_demo_image()

    w, h = raw_img.size
    scaling_factor = 720 / w
    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))

    col1.image(resized_image, use_column_width=True)
    col2.header("Question")

    user_question = col2.text_input("Input your question!", "What are objects there?")
    qa_button = st.button("Submit")

    col2.header("Answer")

    # ===== event =====
    vis_processor = load_processor("blip_image_eval").build(image_size=480)
    text_processor = load_processor("blip_question").build()

    if qa_button:
        if model_type.startswith("BLIP"):
            model = load_model_cache(
                "blip_vqa", model_type="vqav2", is_eval=True, device=device
            )

            img = vis_processor(raw_img).unsqueeze(0).to(device)
            question = text_processor(user_question)

            vqa_samples = {"image": img, "text_input": [question]}
            answers = model.predict_answers(vqa_samples, inference_method="generate")

            col2.write("\n".join(answers), use_column_width=True)


================================================
FILE: dataset_card/avsd_dialogue.md
================================================
![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")

# Audio-Visual Scene-Aware Dialogues (AVSD) 

## Description
[Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided. 


## Task

(https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)

In a **video-grounded dialogue task**, the system must generate responses to a user input in the context of a given dialog.
This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative

## Metrics
Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics. 

## Leaderboard

TBD


## Auto-Downloading

Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instruction to download the dataset. 


## References
"Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh




================================================
FILE: dataset_card/coco_caption.md
================================================
![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")

# Microsoft COCO Dataset (Captioning)

## Description
[Microsoft COCO Captions dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

## Task

(from https://paperswithcode.com/task/image-captioning)

**Image captioning** is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.

## Metrics
Models are typically evaluated according to a [BLEU](https://aclanthology.org/P02-1040/) or [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.

## Leaderboard

(Ranked by BLEU-4)

| Rank |  Model  | BLEU-4 | CIDEr | METEOR | SPICE |                                                                    Resources                                                                     |
| ---- | :-----: | :----: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: |
| 1    |   OFA   |  44.9  | 154.9 |  32.5  | 26.6  |                                [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA)                                 |
| 2    |  LEMON  |  42.6  | 145.5 |  31.4  | 25.5  |                                                                    [paper]()                                                                     |
| 3    |  CoCa  |   40.9   |  143.6  | 33.9 | 24.7 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 4    | SimVLM  |  40.6  | 143.3 |  33.7  | 25.4  |                                                [paper](https://openreview.net/pdf?id=GUrhfTuf_3)                                                 |
| 5    |  VinVL  |  41.0  | 140.9 |  31.1  | 25.2  |                           [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                            |
| 6    |  OSCAR  |  40.7  | 140.0 |  30.6  | 24.5  |                           [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                            |
| 7    |  BLIP   |  40.4  | 136.7 |  31.4  | 24.3  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
| 8    |   M^2   |  39.1  | 131.2 |  29.2  | 22.6  |                 [paper](https://arxiv.org/pdf/1912.08226v2.pdf), [code](https://github.com/aimagelab/meshed-memory-transformer)                  |
| 9    |  BUTD   |  36.5  | 113.5 |  27.0  | 20.3  |               [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention)                |
| 10    | ClipCap |  32.2  | 108.4 |  27.1  | 20.1  |                     [paper](https://arxiv.org/pdf/2111.09734v1.pdf), [code](https://github.com/rmokady/clip_prefix_caption)                      |

## Auto-Downloading

```
cd lavis/datasets/download_scripts && python download_coco.py
```

## References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick


================================================
FILE: dataset_card/coco_retrieval.md
================================================
![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")

# Microsoft COCO Dataset (Retrieval)

## Description
[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

## Task
Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.


## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.

We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.

## Leaderboard
(Ranked by TR@1.)
| Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
| 2    | X-VLM  | 81.2  | 95.6  | 98.2  | 63.4  | 85.8  | 91.5  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
| 3    | ALBEF  | 77.6  | 94.3  | 97.2  | 60.7  | 84.3  | 90.5  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
| 3    | ALIGN  | 77.0  | 93.5  | 96.9  | 59.9  | 83.3  | 89.8  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |
| 4    | VinVL  | 75.4  | 92.9  | 96.2  | 58.8  | 83.5  | 90.3  |                                                                          [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 5    | OSCAR  | 73.5  | 92.2  | 96.0  | 57.5  | 82.8  | 89.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 6    | UNITER | 65.7  | 88.6  | 93.8  | 52.9  | 79.9  | 88.0  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |

## Auto-Downloading

```
cd lavis/datasets/download_scripts && python download_coco.py
```

## References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick


================================================
FILE: dataset_card/conceptual_captions.md
================================================
![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/conceptual_captions.png)
(image credit: https://ai.google.com/research/ConceptualCaptions/download)

# Conceptual Captions Dataset

## Description
(from https://huggingface.co/datasets/conceptual_captions)

Conceptual Captions 3M (CC3M) is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.

Conceptual Captions 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).

## Task

Image-language pre-training; image captioning.

## Auto-Downloading
**Warning**: images of this dataset are downloadeded by requesting URLs. Since URLs may disappear with time, it is expected the downloaded dataset is partial.

### Conceptual Captions 3M
- Download images
```
cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc3m.py
```
- Create annotations by running the notebook
```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_3m.ipynb```

### Conceptual Captions 12M
- Download images
```
cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc12m.py
```
- Create annotations by running the notebook
```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_12m.ipynb```

## References
Edwin G. Ng, Bo Pang, Piyush Sharma and Radu Soricut. 2020. Understanding Guided Image Captioning Performance Across Domains. arXiv preprint arXiv:2012.02339.


================================================
FILE: dataset_card/didemo_retrieval.md
================================================
![Samples from the DiDeMo dataset.](imgs/didemo.png)(Samples from the DiDeMo dataset. Image credit: "https://www.di.ens.fr/~miech/datasetviz/")

# DiDeMo Dataset (Retrieval)

## Description
[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

## Task
Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.


## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.

We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.

## Leaderboard
(Ranked by TR@1.)
<!-- | Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
| 2    | X-VLM  | 81.2  | 95.6  | 98.2  | 63.4  | 85.8  | 91.5  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
| 3    | ALBEF  | 77.6  | 94.3  | 97.2  | 60.7  | 84.3  | 90.5  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
| 3    | ALIGN  | 77.0  | 93.5  | 96.9  | 59.9  | 83.3  | 89.8  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |
| 4    | VinVL  | 75.4  | 92.9  | 96.2  | 58.8  | 83.5  | 90.3  |                                                                          [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 5    | OSCAR  | 73.5  | 92.2  | 96.0  | 57.5  | 82.8  | 89.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 6    | UNITER | 65.7  | 88.6  | 93.8  | 52.9  | 79.9  | 88.0  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          | -->

## Auto-Downloading

```
cd lavis/datasets/download_scripts && python download_didemo.py
```

## References
Anne Hendricks, Lisa, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In Proceedings of the IEEE international conference on computer vision, pp. 5803-5812. 2017.


================================================
FILE: dataset_card/flickr_retrieval.md
================================================
![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/flickr30k.png)Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/")

# Flickr30K Dataset (Retrieval)

## Description
[Flickr30k](https://github.com/tylin/coco-caption) dataset contains 31k+ images collected from Flickr, together with 5 reference sentences provided by human annotators.

## Task
Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.


## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.

We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.

## Leaderboard
(Ranked by TR@1.)
| Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1    |  BLIP  | 97.2  | 99.9  | 100.0  | 87.5  | 97.7  | 98.9  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
| 2    | X-VLM  | 97.1  | 100.0  | 100.0  | 86.9  | 97.3  | 98.7  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
| 3    | ALBEF  | 95.9  | 99.8  | 100.0  | 85.6  | 97.5  | 98.9  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
| 4    | ALIGN  | 95.3  | 99.8  | 100.0  | 84.9  | 97.4  | 98.6  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |                                                      |
| 5    | VILLA  | 87.9  | 97.5  | 98.8  | 76.3  | 94.2  | 96.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 6    | UNITER | 87.3  | 98.0  | 99.2  | 75.6  | 94.1  | 96.8  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |

## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_flickr.py
```

## References
Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV, 123(1):74-93, 2017. [paper]


================================================
FILE: dataset_card/gqa.md
================================================
![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png)

# GQA Dataset

## Description
(from https://cs.stanford.edu/people/dorarad/gqa/about.html)

GQA is a VQA dataset for real-word images which requires visual, spatial and compositional reasoning. 
It consists of 22M questions and 110K images.

## Task
(from https://arxiv.org/abs/1902.09506)

Given an image and a question, the model is required to output a correct answer. 
GQA questions require spatial understanding, multiple reasoning skills and multiple-step inference. 

## Metrics

The metrics are accuracy, consistency, validity, plausibility. The commonly reported metric is accuracy.

## Leaderboard

TBD

## Auto-Downloading

```
cd lavis/datasets/download_scripts && python download_gqa.py
```

## References
"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", Drew A. Hudson, Christopher D. Manning

================================================
FILE: dataset_card/msrvtt_qa.md
================================================
![Samples from MSRVTT-QA dataset.](imgs/msrvtt_qa.png)(Samples from MSRVTT-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)

# MSRVTT Dataset (Video Question Answering)

## Description
[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.

[MSRVTT-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on the MSR-VTT dataset, which is larger and has more complex scenes. The dataset
contains 10K video clips and 243k question answer pairs.

## Task
Video question answering (VideoQA) is the task where
a video and a natural language question are provided and the model
needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).


## Metrics
Accuracy.

## Leaderboard
(Ranked by accurarcy on test-dev.)
| Rank | Model  | Acc. | Resources |
| ---- | :----: | :-------: | :-------: |
| 1    |  ALPro  |  42.1 |  [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
| 2   |  VQA-T  |  41.5 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
| 3   |  CoMVT | 39.5 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
| 4   |  SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
| 5   |  ClipBERT | 37.4 | [paper](https://arxiv.org/abs/2102.06183) [code](https://github.com/jayleicn/ClipBERT)|
| 6   |  HCRN | 35.6 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
| 7   |  HGA | 35.5 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
| 8   |  DualVGR | 35.5 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
| 9   |  HME | 33.0 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
| 10   |  AMU | 32.5 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |


## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_msrvtt.py
```

## References
Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.

Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.


================================================
FILE: dataset_card/msrvtt_retrieval.md
================================================
![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/msrvtt.png)

# MSRVTT Dataset (Retrieval)

## Description
[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.

## Task
Cross modal retrieval: (1) **video-text**: given a video as query, retrieve texts from a gallery; (2) **text-video**: given a text as query, retrieval videos from a gallery.


## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.

We use TR to denote the video-text retrieval recall score and VR to denote text-video retrieval score.

## Leaderboard
(Ranked by TR@1.)
<!-- | Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) | -->

## References
Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.


================================================
FILE: dataset_card/msvd_qa.md
================================================
![Samples from MSVD-QA dataset.](imgs/msvd_qa.png)(Samples from MSVD-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)

# MSVD Dataset (Video Question Answering)

## Description
[MSVD-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on Microsoft Research Video
Description Corpus (https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) which is used in many video captioning
experiments. The MSVD-QA dataset has a total number of 1,970
video clips and 50,505 question answer pairs.


## Task
Video question answering (VideoQA) is the task where
a video and a natural language question are provided and the model
needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).


## Metrics
Accuracy.

## Leaderboard
(Ranked by accurarcy on test-dev.)
| Rank | Model  | Acc. | Resources |
| ---- | :----: | :-------: | :-------: |
| 1   |  VQA-T  |  46.3 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
| 2    |  ALPro  |  45.9 |  [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
| 3   |  CoMVT | 42.6 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
| 4   |  DualVGR | 39.0 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
| 5   |  HCRN | 36.1 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
| 6   |  SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
| 7   |  HGA | 34.7 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
| 8   |  HME | 33.7 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
| 9   |  AMU | 32.0 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
| 10   |  ST-VQA | 31.3 | [paper](https://arxiv.org/pdf/1704.04497.pdf), [code](https://github.com/YunseokJANG/tgif-qa) |


## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_msvd.py
```

## References
Chen, David, and William B. Dolan. "Collecting highly parallel data for paraphrase evaluation." In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190-200. 2011.

Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.


================================================
FILE: dataset_card/nlvr2.md
================================================
![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/NLVR2.png)

# Natural Language for Visual Reasoning for Real (NLVR2)

## Description
(from https://lil.nlp.cornell.edu/nlvr/)

NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.

We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.


## Task
(from https://lil.nlp.cornell.edu/nlvr/)
The Natural Language for Visual Reasoning (NLVR) task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.


## Metrics
Accuracy.

## Leaderboard
(Ranked by accurarcy on dev.)
| Rank | Model  | dev | test | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1    |  VLMo  |   88.6   |   89.5   |  [paper](https://arxiv.org/pdf/2111.02358.pdf) |
| 2    |  CoCa  |   86.1   |   87.0   |  [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 3    | SimVLM  |   84.5   |   85.2   | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 4    | X-VLM  | 84.4  | 84.8  |  [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)
| 5    | VinVL  | 82.7 | 84.0 |                                                                          [paper](https://arxiv.org/pdf/2101.00529.pdf), [code](https://github.com/pzzhang/VinVL)                                                                           |
| 6    | ALBEF  |   82.6   |   83.1   |  [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                                 |
| 7    | BLIP  |   82.2   |   82.2   | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/)|
| 8    |  OSCAR  |  78.1  | 78.4 |                           [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                            |
| 9    | SOHO  |   76.4   |   77.3  | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
| 10    | UNITER | 77.2  | 77.9 |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |


## Downloading
Auto-downloading is not supported for this dataset. Please refer to https://lil.nlp.cornell.edu/nlvr/ and fill in the Google form to download the original images.


## References
Suhr, Alane, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. "A corpus for reasoning about natural language grounded in photographs." arXiv preprint arXiv:1811.00491 (2018).


================================================
FILE: dataset_card/nocaps.md
================================================
![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/nocaps.png)

# Nocaps

## Description

our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps).


## Task: Novel object captioning

(from https://nocaps.org/)

Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed nocaps, for novel object captioning at scale


## Metrics
Models are typically evaluated according to a [CIDEr](https://aclanthology.org/P02-1040/) or [SPICE](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.

## Leaderboard

(Ranked by CIDEr)

| Rank |  Model  | val. CIDEr | val. SPICE |          test CIDEr | test SPICE |                                                          Resources                                                                     |
| ---- | :-----: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: | :---:|:---:|
| 1    |  CoCa  |   122.4   |  15.5  | 120.6 | 15.5| [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2    |  LEMON  |  117.3 | 15.0  |114.3 | 14.9           |                                                         [paper]()                                                                     |
| 3    |  BLIP   |  113.2 | 14.8 |  -  | -  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
| 4    | SimVLM  |  112.2  | - | 110.3  | 14.5  |                                                [paper](https://openreview.net/pdf?id=GUrhfTuf_3)                                                 |
| 5    |  VinVL  |  105.1  | 14.4 |  103.7  | 14.4  |                           [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                            |

## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_nocaps.py
```

## References
Agrawal, Harsh, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. "Nocaps: Novel object captioning at scale." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957. 2019.


================================================
FILE: dataset_card/sbu_caption.md
================================================
![sbu caption](imgs/sbu_caption.png)
(image credit: http://tamaraberg.com/papers/generation_nips2011.pdf)

# SBU Caption Dataset
(from http://tamaraberg.com/papers/generation_nips2011.pdf)

SBU caption dataset is a new dataset, collected by performing Flickr queries and
then filtering the noisy results down to 1 million images with associated visually
relevant captions.

## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_sbu.py
```
## References
```bibtex
@inproceedings{Ordonez:2011:im2text,
  Author    = {Vicente Ordonez and Girish Kulkarni and Tamara L. Berg},
  Title     = {Im2Text: Describing Images Using 1 Million Captioned Photographs},
  Booktitle = {Neural Information Processing Systems ({NIPS})},
  Year      = {2011},
}
```


================================================
FILE: dataset_card/snli_visual_entailment.md
================================================
![From https://github.com/necla-ml/SNLI-VE.](imgs/snli_ve.png)

# SNLI-VE: Visual Entailment Dataset

## Description
(from https://arxiv.org/abs/1811.10582)

**The SNLI_VE dataset is built on top of Flickr30k. See downloading scripts below.**

Distribution by Split
The data details of train, dev and test split is shown below. The instances of three labels (entailment, neutral and contradiction) are evenly distributed for each split.

|Train|	Dev	|Test|
| ---- | :----: | :------: |
|#Image | 29783 | 1000 |	1000
|#Entailment | 176932 | 5959 | 	5973
|#Neutral| 176045 | 5960 | 	5964
|#Contradiction |	176550| 5939|	5964
|Vocabulary | Size|	29550| 6576|	6592

## Task
(from https://github.com/necla-ml/SNLI-VE)

The problem that Visual Entailment (VE) is trying to solve is to reason about the relationship between an image premise Pimage and a text hypothesis Htext.

Specifically, given an image as premise, and a natural language sentence as hypothesis, three labels (entailment, neutral and contradiction) are assigned based on the relationship conveyed by the (P_{image}, H_{text})

entailment holds if there is enough evidence in P_{image} to conclude that H_{text} is true.
contradiction holds if there is enough evidence in P_{image} to conclude that H_{text} is false.
Otherwise, the relationship is neutral, implying the evidence in P_{image} is insufficient to draw a conclusion about H_{text}.


## Metrics
Accuracy.

## Leaderboard
(Ranked by accurarcy on dev.)
| Rank | Model  | dev | test | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1    |  CoCa  |   87.0   |   87.1   |  [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2    | SimVLM  |   86.2   |   86.3   | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 3    | SOHO  |   85.0   |  85.0  | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
| 4    | ALBEF  |   80.8   |   80.9   |  [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                                 |
| 5    | VILLA  | 80.2  | 80.0  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 6    | UNITER | 79.4  | 79.4 |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
| 7    | LXMERT | 72.4  | 72.5 |                                                          [paper](https://aclanthology.org/D19-1514.pdf), [code](https://github.com/airsplay/lxmert)                                                          |
| 8    |  BUTD   |  65.3  | 65.7 |   [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention)                |

## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_flickr.py
```

## References
Xie, Ning, Farley Lai, Derek Doran, and Asim Kadav. "Visual entailment task for visually-grounded language learning." arXiv preprint arXiv:1811.10582 (2018).


================================================
FILE: dataset_card/vqav2.md
================================================
![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/vqav2.png)

# Microsoft COCO Dataset (VQAv2)

## Description
(from https://visualqa.org/index.html)

Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.

- 265,016 images (COCO and abstract scenes)
- At least 3 questions (5.4 questions on average) per image
- 10 ground truth answers per question
- 3 plausible (but likely incorrect) answers per question
- Automatic evaluation metric

## Task
(from https://arxiv.org/pdf/1505.00468.pdf)

The task of free-form and open-ended Visual Question Answering (VQA): given an image and a natural
language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such
as helping the visually impaired, both the questions and answers are open-ended..

## Metrics
Accuracies computed by evaluation server: https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278

## Leaderboard
(Ranked by accurarcy on test-dev.)
| Rank | Model  | test-dev | test-std | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1    |  VLMo  |   82.8   |   82.8   |  [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2    |  CoCa  |   82.3   |   82.3   |  [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 3    |  OFA  |   82.0   |   82.0   |   [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA)  |
| 4    |  Florence  |   80.2   |   80.4   |   [paper](https://arxiv.org/abs/2111.11432), [code](https://github.com/OFA-Sys/OFA)  |
| 5    | SimVLM  |   80.0   |   80.3   | [paper](https://openreview.net/pdf?id=GUrhfTuf_3)                                                                                                                  |
| 6    | BLIP  |   78.3   |   78.3   | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
| 7    | X-VLM  | 78.2  | 78.4  |  [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
| 8    | VinVL  |   76.6   |   76.6  | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
| 9    | ALBEF  |   75.8   |   76.0   |  [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                                                                       |
| 10    | UNITER | 73.8  | 74.0 |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |

## Auto-Downloading

```
cd lavis/datasets/download_scripts && python download_coco.py
```

## References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick

"Vqa: Visual question answering." Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh.


================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS    ?=
SPHINXBUILD   ?= sphinx-build
SOURCEDIR     = source
BUILDDIR      = build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)


================================================
FILE: docs/benchmark.rst
================================================
Benchmark
############

We provide scripts for evaluating and training models on task datasets. The following benchmark results are included for reference.


ALBEF
*******
.. list-table::
   :widths: 30 80 20

   * - **Pretraining**
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/pretrain.sh>`__
   * -
     - Visual Genome (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_vg.py>`__)
     -
   * -
     - SBU (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_sbu.py>`__)
     -
   * -
     - CC3M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc3m.py>`__)
     -
   * -
     - CC12M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc12m.py>`__)
     -

.. list-table::
   :widths: 30 40 20 20 20 30 30
   :header-rows: 1

   * -
     - **Retrieval**
     - **R1**
     - **R5**
     - **R10**
     - **Training**
     - **Evaluation**
   * - TR
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 77.6
     - 94.1
     - 97.2
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_coco_retrieval_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_coco_retrieval.sh>`__
   * - IR
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 61.0
     - 84.5
     - 90.7
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_coco_retrieval_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_coco_retrieval.sh>`__
   * - TR
     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
     - 77.6
     - 94.1
     - 97.2
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_flickr30k_retrieval_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_flickr30k_retrieval.sh>`__
   * - IR
     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
     - 61.0
     - 84.5
     - 90.7
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_flickr30k_retrieval_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_flickr30k_retrieval.sh>`__


.. list-table::
   :widths: 20 20 20 20 20
   :header-rows: 1

   * - **VQA**
     - **test-dev**
     - **test-std/test**
     - **Training**
     - **Evaluation**
   * - VQAv2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 76.35
     - 76.54
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_vqa_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/test_albef_vqa.sh>`__
   * - OKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - NA
     - 54.7 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_okvqa_albef.sh>`__
     - NA
   * - AOKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 54.5
     - NA
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_aokvqa_albef.sh>`__
     - NA

  
.. list-table::
   :widths: 20 20 20 20 20
   :header-rows: 1

   * - **Multimodal Classification**
     - **val**
     - **test**
     - **Training**
     - **Evaluation**
   * - SNLI-VE (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 80.60
     - 81.04
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_ve_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_albef_ve.sh>`__
   * - NLVR2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 82.47 
     - 82.91 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_nlvr_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/eval_albef_nlvr.sh>`__
  
BLIP
*******
.. list-table::
   :widths: 30 80 20

   * - **Pretraining (14M)**
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/pretrain.sh>`__
   * -
     - Visual Genome (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_vg.py>`__)
     -
   * -
     - SBU (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_sbu.py>`__)
     -
   * -
     - CC3M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc3m.py>`__)
     -
   * -
     - CC12M (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/DownloadConceptualCaptions/download_data_cc12m.py>`__)
     -

.. list-table::
   :widths: 30 40 20 20 20 30 30
   :header-rows: 1

   * - **Tasks**
     - **Retrieval**
     - **R1**
     - **R5**
     - **R10**
     - **Training**
     - **Evaluation**
   * - TR
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 82.0
     - 95.8
     - 98.1
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_coco.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_coco.sh>`__
   * - IR
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 64.5
     - 86.0
     - 91.7
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_coco.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_coco.sh>`__
   * - TR
     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
     - 96.9
     - 99.9
     - 100.0
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_flickr.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_flickr.sh>`__
   * - IR
     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
     - 87.5
     - 97.6
     - 98.9
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_retrieval_flickr.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_ret_flickr.sh>`__


.. list-table::
   :widths: 20 20 20 20 20
   :header-rows: 1

   * - **VQA**
     - **test-dev**
     - **test-std/test**
     - **Training**
     - **Evaluation**
   * - VQAv2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 78.23
     - 78.29
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/train/train_vqa_albef.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/albef/eval/test_albef_vqa.sh>`__
   * - OKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - NA
     - 55.4 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_okvqa.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_okvqa.sh>`__
   * - AOKVQA (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 56.2
     - 50.1 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_aokvqa.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_aokvqa.sh>`__


.. list-table::
   :widths: 20 20 20 20 20 20
   :header-rows: 1

   * - **Image Captioning**
     - **BLEU@4**
     - **CIDEr**
     - **SPICE**
     - **Training**
     - **Evaluation**
   * - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 39.9
     - 133.5
     - 23.7
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_caption_coco.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_coco_cap.sh>`__
   * - NoCaps (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_nocaps.py>`__)
     - 31.9
     - 109.1
     - 14.7
     - NA
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_nocaps.sh>`__


.. list-table::
   :widths: 20 20 20 20 20
   :header-rows: 1

   * - **Multimodal Classification**
     - **val**
     - **test**
     - **Training**
     - **Evaluation**
   * - NLVR2 (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 82.48
     - 83.25
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/train/train_nlvr.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip/eval/eval_nlvr.sh>`__

CLIP
*******
.. list-table::
   :widths: 30 40 20 20 20 30
   :header-rows: 1

   * - **Tasks**
     - **Retrieval (Zero-shot)**
     - **R1**
     - **R5**
     - **R10**
     - **Evaluation**
   * - TR
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 57.2
     - 80.5
     - 87.8
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_coco.sh>`__
   * - IR
     - COCO (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_coco.py>`__)
     - 36.5
     - 60.8
     - 71.0
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_coco.sh>`__
   * - TR
     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
     - 86.5
     - 98.0
     - 99.1
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_flickr.sh>`__
   * - IR
     - Flickr30k (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_flickr.py>`__)
     - 67.0
     - 88.9
     - 93.3
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_ret_flickr.sh>`__

.. list-table::
   :widths: 20 20 20
   :header-rows: 1

   * - **Multimodal Classification**
     - **val**
     - **Evaluation**
   * - ImageNet 
     - 76.5 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/clip/eval/eval_clip_zs_imnet.sh>`__


ALPRO
*******
.. list-table::
   :widths: 30 40 20 20 20 20 30
   :header-rows: 1

   * - **Tasks**
     - **Retrieval**
     - **R1**
     - **R5**
     - **R10**
     - **Training**
     - **Evaluation**
   * - TR
     - MSRVTT (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_msrvtt.py>`__)
     - 33.2
     - 60.5 
     - 71.7 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msrvtt_ret.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msrvtt_ret.sh>`__
   * - VR
     - MSRVTT (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_msrvtt.py>`__)
     - 33.8
     - 61.4
     - 72.7
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msrvtt_ret.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msrvtt_ret.sh>`__
   * - TR
     - DiDeMo (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_didemo.py>`__)
     - 38.8 
     - 66.4
     - 76.8
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_didemo_ret.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_didemo_ret.sh>`__
   * - VR
     - DiDeMo (`download <https://github.com/salesforce/LAVIS/blob/main/lavis/datasets/download_scripts/download_didemo.py>`__)
     - 36.6
     - 67.5
     - 77.9
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_didemo_ret.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_didemo_ret.sh>`__

.. list-table::
   :widths: 20 20 20 20
   :header-rows: 1

   * - **Video QA**
     - **test**
     - **Training**
     - **Evaluation**
   * - MSRVTT 
     - 42.1 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msrvtt_qa.sh>`__
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msrvtt_qa.sh>`__
   * - MSVD 
     - 46.0 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/train/train_msvd_qa.sh>`__ 
     - `script <https://github.com/salesforce/LAVIS/blob/main/run_scripts/alpro/eval/eval_msvd_qa.sh>`__

================================================
FILE: docs/build_docs.sh
================================================
#!/bin/bash
set -euo pipefail

# Change to root directory of repo
DIRNAME=$(cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
cd "${DIRNAME}/.."

# # Set up virtual environment
pip3 install setuptools wheel virtualenv
if [ ! -d venv ]; then
  rm -f venv
  virtualenv venv
fi
source venv/bin/activate

# # Get current git branch & stash unsaved changes
GIT_BRANCH=$(git branch --show-current)
if [ -z "${GIT_BRANCH}" ]; then
    GIT_BRANCH="main"
fi
git stash

# Set up exit handler to restore git state & delete temp branches
# function exit_handler {
#     git reset --hard
#     git checkout "${GIT_BRANCH}" --
#     git stash pop || true
#     for version in $(git tag --list 'v[0-9]*'); do
#         branch="${version}_local_docs_only"
#         if git show-ref --verify --quiet "refs/heads/$branch"; then
#             git branch -D "$branch"
#         fi
#     done
# }
# trap exit_handler EXIT

# Clean up build directory and install Sphinx requirements
pip3 install -r "${DIRNAME}/requirements.txt"
sphinx-build -M clean "${DIRNAME}" "${DIRNAME}/_build"

# Build API docs for current head
export current_version="latest"
pip3 install "."
sphinx-build -b html "${DIRNAME}" "${DIRNAME}/_build/html/${current_version}" -W --keep-going
rm -rf "${DIRNAME}/_build/html/${current_version}/.doctrees"
#pip3 uninstall -y omnixai

# Install all previous released versions
# and use them to build the appropriate API docs.
# Uninstall after we're done with each one.
# versions=()
# checkout_files=("${DIRNAME}/*.rst" "lavis" "tutorials" "setup.py")
# for version in $(git tag --list 'v[0-9]*'); do
#     versions+=("$version")
#     git checkout -b "${version}_local_docs_only"
#     for f in $(git diff --name-only --diff-filter=A "tags/${version}" "${DIRNAME}/*.rst"); do
#         git rm "$f"
#     done
#     git checkout "tags/${version}" -- "${checkout_files[@]}"
#     export current_version=${version}
#     pip3 install ".[all]"
#     sphinx-build -b html "${DIRNAME}" "${DIRNAME}/_build/html/${current_version}" -W --keep-going
#     rm -rf "${DIRNAME}/_build/html/${current_version}/.doctrees"
#     #pip3 uninstall -y omnixai
#     git reset --hard
#     git checkout "${GIT_BRANCH}" --
# done

# Determine the latest stable version if there is one
# if (( ${#versions[@]} > 0 )); then
#   stable_hash=$(git rev-list --tags --max-count=1)
#   stable_version=$(git describe --tags "$stable_hash")
#   export stable_version
# else
export stable_version="latest"
# fi

# Create dummy HTML's for the stable version in the base directory
while read -r filename; do
    filename=$(echo "$filename" | sed "s/\.\///")
    n_sub=$(echo "$filename" | (grep -o "/" || true) | wc -l)
    prefix=""
    for (( i=0; i<n_sub; i++ )); do
        prefix+="../"
    done
    url="${prefix}${stable_version}/$filename"
    mkdir -p "${DIRNAME}/_build/html/$(dirname "$filename")"
    cat > "${DIRNAME}/_build/html/$filename" <<EOF
<!DOCTYPE html>
<html>
   <head>
      <title>LAVIS Documentation</title>
      <meta http-equiv = "refresh" content="0; url='$url'" />
   </head>
   <body>
      <p>Please wait while you're redirected to our <a href="$url">documentation</a>.</p>
   </body>
</html>
EOF
done < <(cd "${DIRNAME}/_build/html/$stable_version" && find . -name "*.html")
echo "Finished writing to _build/html."

================================================
FILE: docs/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))


# -- Project information -----------------------------------------------------

project = "LAVIS"
copyright = "2022, salesforce.com inc."
author = (
    "Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C.H. Hoi"
)


# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ["nbsphinx"]

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []


# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
#
# html_theme = "alabaster"
html_theme = "sphinx_rtd_theme"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

# pygments_style = "sphinx"


================================================
FILE: docs/getting_started.rst
================================================
Dataset Zoo
##################
LAVIS inherently supports a wide variety of common language-vision datasets by providing automatic download scripts to help download and organize these datasets; 
and implements PyTorch datasets for these datasets. To view supported datasets, use the following code:

.. code-block:: python

    from lavis.datasets.builders import dataset_zoo
    dataset_names = dataset_zoo.get_names()
    print(dataset_names)
    # ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
    #  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
    #  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
    #  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
    print(len(dataset_names))
    # 23


Auto-Downloading and Loading Datasets
######################################
We now take COCO caption dataset as an example to demonstrate how to download and prepare the dataset.

In ``lavis/datasets/download_scripts/``, we provide tools to download most common public language-vision datasets supported by LAVIS.
The COCO caption dataset uses images from COCO dataset. Therefore, we first download COCO images via:

.. code-block:: bash
    
    cd lavis/datasets/download_scripts/ && python download_coco.py

This will automatically download and extract COCO images to the default LAVIS cache location.
The default cache location is ``~/.cache/lavis``, defined in ``lavis/configs/default.yaml``.

After downloading the images, we can use ``load_dataset()`` to obtain the dataset. On the first run, this will automatically download and cache annotation files.

.. code-block:: python

    from lavis.datasets.builders import load_dataset
    coco_dataset = load_dataset("coco_caption")

    print(coco_dataset.keys())
    # dict_keys(['train', 'val', 'test'])

    print(len(coco_dataset["train"]))
    # 566747

    print(coco_dataset["train"][0])
    # {'image': <PIL.Image.Image image mode=RGB size=640x480>,
    #  'text_input': 'A woman wearing a net on her head cutting a cake. ',
    #  'image_id': 0}

If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.

.. code-block:: python

    coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)


Model Zoo
####################################
LAVIS supports a growing list of pre-trained models for different tasks,
datatsets and of varying sizes. Let's get started by viewing the supported models.

.. code-block:: python

    from lavis.models import model_zoo
    print(model_zoo)
    # ==================================================
    # Architectures                  Types
    # ==================================================
    # albef_classification           base, ve
    # albef_nlvr                     base
    # albef_pretrain                 base
    # albef_retrieval                base, coco, flickr
    # albef_vqa                      base, vqav2
    # alpro_qa                       base, msrvtt, msvd
    # alpro_retrieval                base, msrvtt, didemo
    # blip_caption                   base, base_coco, large, large_coco
    # blip_classification            base
    # blip_feature_extractor         base
    # blip_nlvr                      base
    # blip_pretrain                  base
    # blip_retrieval                 base, coco, flickr
    # blip_vqa                       base, vqav2
    # clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50

    # show total number of support model variants
    len(model_zoo)
    # 33


Inference with Pre-trained Models
####################################

Now let's see how to use models in LAVIS to perform inference on example data. We first
load a sample image from local.

.. code-block:: python

    from PIL import Image

    # setup device to use
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # load sample image
    raw_image = Image.open("docs/_static/merlion.png").convert("RGB")

This example image shows `Merlion park <https://en.wikipedia.org/wiki/Merlion>`_ (`image credit <https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/>`_), a landmark in Singapore.

.. image:: _static/merlion.png

Image Captioning
*******************************
We now use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
pre-trained model with its preprocessors (transforms),  we use ``load_model_and_preprocess()`` with the following arguments:

- ``name``: The name of the model to load. This could be a pre-trained model, task model, or feature extractor. See ``model_zoo`` for a full list of model names.
- ``model_type``: Each architecture has variants trained on different datasets and at different scale. See Types column in ``model_zoo`` for a full list of model types.
- ``is_eval``: if `True`, set the model to evaluation mode. This is desired for inference or feature extraction.
- ``device``: device to load the model to.

.. code-block:: python

    from lavis.models import load_model_and_preprocess
    # loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
    # this also loads the associated image processors
    model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

    # preprocess the image
    # vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

    # generate caption
    model.generate({"image": image})
    # ['a large fountain spewing water into the air']


You may also load models and their preprocessors separately via ``load_model()`` and ``load_processor()``.
In BLIP, you can also generate diverse captions by turning nucleus sampling on.

.. code-block:: python

    from lavis.processors import load_processor
    from lavis.models import load_model

    # load image preprocesser used for BLIP
    vis_processor = load_processor("blip_image_eval").build(image_size=384)
    model = load_model(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

    image = vis_processor(image).unsqueeze(0).to(device)
    model.generate({"image": raw_image}, use_nucleus_sampling=True)
    # one generated random sample: ['some very pretty buildings and some water jets']


Visual question answering (VQA)
*******************************
BLIP model is able to answer free-form questions about images in natural language.
To access the VQA model, simply replace the ``name`` and ``model_type`` arguments 
passed to ``load_model_and_preprocess()``.

.. code-block:: python

    from lavis.models import load_model_and_preprocess
    model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)

    # ask a random question.
    question = "Which city is this photo taken?"
    
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    question = txt_processors["eval"](question)

    model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
    # ['singapore']


Unified Feature Extraction Interface
####################################

LAVIS provides a unified interface to extract multimodal features from each architecture.
To extract features, we load the feature extractor variants of each model.
The multimodal feature can be used for multimodal classification. The low-dimensional unimodal features can be used to compute cross-modal similarity.

.. code-block:: python

    from lavis.models import load_model_and_preprocess 
    
    model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
    caption = "a large fountain spewing water into the air"

    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    text_input = txt_processors["eval"](caption)

    sample = {"image": image, "text_input": [text_input]}

    features_multimodal = model.extract_features(sample)
    print(features_multimodal.keys())
    # odict_keys(['image_embeds', 'multimodal_embeds'])
    print(features_multimodal.multimodal_embeds.shape)
    # torch.Size([1, 12, 768]), use features_multimodal[:, 0, :] for multimodal classification tasks

    features_image = model.extract_features(sample, mode="image")
    print(features_image.keys())
    # odict_keys(['image_embeds', 'image_embeds_proj'])
    print(features_image.image_embeds.shape)
    # torch.Size([1, 197, 768])
    print(features_image.image_embeds_proj.shape)
    # torch.Size([1, 197, 256])

    features_text = model.extract_features(sample, mode="text")
    print(features_text.keys())
    # odict_keys(['text_embeds', 'text_embeds_proj'])
    print(features_text.text_embeds.shape)
    # torch.Size([1, 12, 768])
    print(features_text.text_embeds_proj.shape)
    # torch.Size([1, 12, 256])
    
    similarity = features_image.image_embeds_proj[:, 0, :] @ features_text.text_embeds_proj[:, 0, :].t()
    print(similarity)
    # tensor([[0.2622]])

Since LAVIS supports a unified feature extraction interface, minimal changes are necessary to use a different model as feature extractor. For example,
to use ALBEF as the feature extractor, one only needs to change the following line:

.. code-block:: python

    model, vis_processors, txt_processors = load_model_and_preprocess(name="albef_feature_extractor", model_type="base", is_eval=True, device=device)

Similarly, to use CLIP as feature extractor: 

.. code-block:: python

    model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="base", is_eval=True, device=device)
    # model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="RN50", is_eval=True, device=device)
    # model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="ViT-L-14", is_eval=True, device=device)


================================================
FILE: docs/index.rst
================================================
.. LAVIS documentation master file, created by
   sphinx-quickstart on Sun Jul 31 10:32:27 2022.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to LAVIS's documentation!
=================================

.. toctree::
   :maxdepth: 1
   :caption: Introduction

   intro


.. toctree::
   :maxdepth: 1
   :caption: Getting Started

   getting_started


..    :maxdepth: 1
..    :caption: Advanced Training

..    advanced_training


.. toctree::
   :maxdepth: 2
   :caption: Advanced Usage

   benchmark
   tutorial


.. Documentations
.. ===================


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


================================================
FILE: docs/intro.rst
================================================
What is LAVIS?
####################################

LAVIS is a Python deep learning library for LAnguage-and-VISion research and applications.
It features a unified design to access state-of-the-art foundation language-vision models (`ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_,
`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_), common tasks 
(retrieval, captioning, visual question answering, multimodal classification etc.) and datasets (COCO, Flickr, Nocaps, Conceptual
Commons, SBU, etc.).

This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal
scenarios, and benchmark them across standard and customized datasets. 

Key features of LAVIS include:

- **Modular and Extensible Library Design**: facilitating to easily utilize and repurpose existing modules (datasets, models, preprocessors), also to add new modules.

- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.

- **Reproducible Model Zoo**: provided training/pre-training recipies to easily replicate and extend state-of-the-art models.

- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloaing scripts to help prepare a large variety of datasets and their annotations.

Other features include:

- **Distributed Training** using multiple GPUs on one machine or across multiple machines.

- **Web Demo**: try supported models on your own pictures, questions etc.

- **Leaderboard**: comparing state-of-the-art models across standard datasets. 

- **Dataset Explorer**: help browse and understand language-vision datasets.

Supported Tasks, Models and Datasets
####################################

The following table shows the supported models and language-vision tasks by LAVIS. Adapting existing models to more tasks is possible and next to come in future releases.

======================================== =========================== ============================================= ============ 
Tasks                                     Supported Models            Supported Datasets                            Modalities  
======================================== =========================== ============================================= ============ 
Image-text Pre-training                   ALBEF, BLIP                 COCO, VisualGenome, SBU, ConceptualCaptions  image, text  
Image-text Retrieval                      ALBEF, BLIP, CLIP           COCO, Flickr30k                              image, text  
Text-image Retrieval                      ALBEF, BLIP, CLIP           COCO, Flickr30k                              image, text  
Visual Question Answering                 ALBEF, BLIP                 VQAv2, OKVQA, A-OKVQA                        image, text  
Image Captioning                          BLIP                        COCO, NoCaps                                 image, text  
Image Classification                      CLIP                        ImageNet                                     image        
Natural Language Visual Reasoning (NLVR)  ALBEF, BLIP                 NLVR2                                        image, text  
Visual Entailment (VE)                    ALBEF                       SNLI-VE                                      image, text  
Visual Dialogue                           BLIP                        VisDial                                      image, text  
Video-text Retrieval                      BLIP, ALPRO                 MSRVTT, DiDeMo                               video, text  
Text-video Retrieval                      BLIP, ALPRO                 MSRVTT, DiDeMo                               video, text  
Video Question Answering (VideoQA)        BLIP, ALPRO                 MSRVTT, MSVD                                 video, text  
Video Dialogue                            VGD-GPT                     AVSD                                         video, text  
Multimodal Feature Extraction             ALBEF, CLIP, BLIP, ALPRO    customized                                   image, text  
======================================== =========================== ============================================= ============ 

Library Design
####################################

.. image:: _static/architecture.png
  :width: 550

LAVIS has six key modules.

- ``lavis.runners`` manages the overall training and evaluation lifecycle. It is also responsible for creating required components lazily as per demand, such as optimizers, learning rate schedulers and dataloaders. Currently ``RunnerBase`` implements epoch-based training and ``RunerIters`` implements iteration-based training.
- ``lavis.tasks`` implements concrete training and evaluation logic per task. A task could be, for example, retrieval, captioning, pre-training. The rationale to have an abstraction of task is to accommodate task-specific training and evaluation. For example, evaluating a retrieval model is different from a classification model.
- ``lavis.datasets`` is responsible for creating datasets, where ``lavis.datasets.builders`` loads dataset configurations, downloads annotations and returns a dataset object; ``lavis.datasets.datasets`` defines the supported datasets, each is a ``torch.utils.data.Dataset`` instance. We also provide `automatic dataset downloading tools` in ``datasets/download_scripts`` to help prepare common public datasets.
- ``lavis.models`` holds definition for the supported models and shared model layers.
- ``lavis.processors`` handles preprocessing of text and images/videos before feeding the model. For images and videos, a processor can be thought as transfroms in torchvision; for text input, this may include lowering case, truncation etc.
- ``lavis.common`` module contains shared classes and methods used by multiple other modules. For example,

   - ``lavis.common.config`` contains classes to store and manipulate configuration files used by LAVIS. In particular, we use a hierarchical configuration design, to allow highly customizable training and evaluation.
   - ``lavis.common.registry``  serves as a centralized place to manage modules that share the same functionalities. It allows building datasets, models, tasks, and learning rate schedulers during runtime, by specifying their names as string in the configuration file.
   - ``lavis.common.optims`` contains definitions of learning rate schedulers.
   - ``lavis.common.dist_utils`` contains utilities for distributed training and evaluation.
   - ``lavis.common.utils`` contains miscellaneous utilities, mostly IO-related helper functions.


Installation
############
1. (Optional) Creating conda environment

.. code-block:: bash

   conda create -n lavis python=3.8
   conda activate lavis

2. Cloning and building from source

.. code-block:: bash

   git clone https://github.com/salesforce/LAVIS.git
   cd LAVIS
   pip install .

If you would like to develop on LAVIS, you may find it easier to build with editable mode::

   pip install -e .



================================================
FILE: docs/make.bat
================================================
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build

if "%1" == "" goto help

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
	echo.
	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
	echo.installed, then set the SPHINXBUILD environment variable to point
	echo.to the full path of the 'sphinx-build' executable. Alternatively you
	echo.may add the Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.http://sphinx-doc.org/
	exit /b 1
)

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd


================================================
FILE: docs/requirements.txt
================================================
GitPython
ipykernel
nbsphinx==0.8.7
pandoc
sphinx
sphinx_autodoc_typehints
sphinx_rtd_theme

================================================
FILE: docs/tutorial.configs.rst
================================================
.. _config:

Training Models on Task Datasets (Commands and Configurations) 
#################################################################

LAVIS provides scripts to pre-train and finetune supported models on standard language-vision tasks, stored at ``lavis/run_scripts/``. 
To replicate the experiments, just run these bash scripts. For example, to train BLIP model on the image-text retrieval task with MSCOCO dataset, we can run

.. code-block::

    bash run_scripts/blip/train/train_retrieval_coco.sh

Inside the scripts, we can see 

.. code-block:: bash

    python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/retrieval_coco_ft.yaml

where we start a pytorch distributed training on 8 GPUs (you may change according to your own hardware setup). The ``--cfg-path`` specifys a `runtime configuration file`, specifying
the task, model, dataset and training recipes. 

Available options and their descriptions are as below.

.. LAVIS executes training and evaluation based on arguments specified in the configuration files. The default model and dataset configurations are defined in ``lavis/configs``. The task-specific configurations are defined in ``lavis/projects``. Task-specific configurations have higher priority over the default configurations.

.. The following tables provide explanations for the arguments in the configuration files.

.. list-table::
   :widths: 30 40
   :header-rows: 1

   * - Model Configurations
     - Functionalities
   * - arch
     - | name of the model from the model zoo
       | default: task-dependent
   * - model_type
     - | the type of the model (e.g., base)
       | default: task-dependent
   * - load_pretrained
     - | load pretrained weights
       | default: True (for finetuning task) | False (for pretraining task) 
   * - load_finetuned
     - | load task-specific finetuned weights
       | default: False (for finetuning task) | True (for evaluation) 
   * - pretrained 
     - | URL or local path which stores the pretrained model, defined in the default model configuration file
       | default: task-dependent 
   * - finetuned
     - | URL or local path which stores the finetuned model, defined in the default model configuration file
       | default: task-dependent

.. list-table::
   :widths: 30 50
   :header-rows: 1

   * - Dataset Configurations
     - Functionalities
   * - vis_processor
     - | pre-processing of visual input
       | default: task-dependent
   * - text_processor
     - | pre-processing of text input
       | default: task-dependent
   * - build_info
     - | dataset information including the storage location, defined in the default dataset configuration file
       | default: task-dependent

.. list-table::
   :widths: 30 50
   :header-rows: 1

   * - Runtime Configurations
     - Functionalities
   * - task
     - | name of the task
       | default: task-dependent
   * - lr_sched
     - | learning rate schedular
       | default: linear_warmup_cosine_lr
   * - init_lr
     - | initial learning rate (after warmup)
       | default: task-dependent
   * - min_lr
     - | final learning rate after decay
       | default: task-dependent
   * - warmup_lr
     - | starting learning rate for warmup
       | default: init_lr (no warmup)
   * - lr_decay_rate
     - | learning rate decay per epoch for step_lr_shedule
       | default: 0.9
   * - warmup_steps
     - | number of steps for learning rate warmup
       | default: 0
   * - max_epoch
     - | total number of training epochs
       | default: task-dependent
   * - weight_decay
     - | weight decay coefficient for the optimizer
       | default: 0.05
   * - batch_size_train
     - | batch size during training
       | default: task-dependent
   * - batch_size_eval
     - | batch size during evaluation
       | default: task-dependent
   * - seed
     - | pseudo random number generator seed
       | default: 42
   * - output_dir
     - | directory to store logs, results and checkpoints
       | default: task-dependent
   * - resume_ckpt_path
     - | path of the checkpoint to resume training from
       | default: None
   * - evaluate
     - | only perform evaluation without training
       | default: False
   * - train_splits
     - | dataset splits used for training
       | default: ["train"]
   * - valid_splits
     - | dataset splits used for validation
       | default: ["val"]
   * - test
     - | dataset splits used for test
       | default: ["test"]
   * - device
     - | use cpu or gpu (cuda)
       | default: cuda
   * - world_size
     - | number of processes participating in the job
       | default: 1
   * - dist_url
     - | URL specifying how to initialize the process group
       | default: "env://"
   * - distributed
     - | use distributed training
       | default: True
   * - amp
     - | use automatic mixed precision training
       | default: False

.. list-table::
   :widths: 40 50
   :header-rows: 1

   * - Text Generation Configurations
     - Functionalities
   * - max_len
     - | maximum number of text tokens to generate
       | default: 20 (for image captioning)
   * - min_len
     - | minimum number of text tokens to generate
       | default: 5 (for image captioning)
   * - num_beams
     - | number of beams to perform beam search
       | default: 3

.. list-table::
   :widths: 40 50
   :header-rows: 1

   * - Multimodal Retrieval Configurations
     - Functionalities
   * - negative_all_rank
     - | collect negatives from all processes for the image-text matching loss
       | default: True (for coco)
   * - k_test
     - | number of retrieval candidates ranked from contrastive similarity
       | default: 256 (for coco)


================================================
FILE: docs/tutorial.datasets.rst
================================================
Adding Datasets
################################################

This is a tutorial on adding a new dataset using ``lavis.datasets`` module. 

The LAVIS library includes a standard dataset module, which allows customization to add new datasets. 
The ``lavis.datasets`` module is designed such that any new dataset class can be easily added and adapted from our code base, including creating dataset configuration, and defining and associating new dataset classes.

In this tutorial, we will replicate the steps to add a dataset class for the `Audio-Visual Scene-Aware Dialogue (AVSD) <https://arxiv.org/pdf/1901.09107.pdf>`_ benchmark for the video-grounded dialogue task.

Dataset Configuration ``lavis.configs.datasets``
**************************************************************

First, we define the basic configurations for this dataset, including a new dataset class ``avsd_dialogue``, dataset card, and data types. 
We can define any new dataset configuration in ``lavis.configs.datasets``. For instance, under this module, we can set up a configuration file ``avsd/defaults_dial.yaml`` as follows:  

.. code-block:: yaml

    datasets:
      avsd_dialogue: # name of the dataset builder
        dataset_card: dataset_card/avsd_dialogue.md # path to the dataset card 
        data_type: features # [images|videos|features] we use features in this case for extracted video features 

        build_info:
          # Be careful not to append minus sign (-) before split to avoid itemizing
          annotations:
            train:
              url: /export/home/data/avsd/train_set4DSTC7-AVSD.json
              storage: avsd/annotations/train.json
            val:
              url: /export/home/data/avsd/valid_set4DSTC7-AVSD.json
              storage: avsd/annotations/val.json 
            test:
              url: /export/home/data/avsd/test_set4DSTC7-AVSD.json
              storage: avsd/annotations/test.json 
          features:
            storage: /export/home/data/avsd/features/ 


Dataset Card
===============
One optional step to set up dataset configuration is defining a dataset card, which contains more details about the dataset such as description, tasks, and metrics. 
For instance, we can define a dataset card for the AVSD benchmark in ``dataset_card/avsd_dialogue.md``.
Depending on the dataset, we included in its corresponding dataset card the command for auto-downloading data (with python code defined in ``lavis.datasets.download_scripts``) that will automatically load the data and store it in a specific folder.
Else, you should describe in the dataset card the external download instructions from the original data source to load the dataset properly. 

One example of a dataset card for the AVSD benchmark is: 

.. code-block:: md

    ![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")
    
    # Audio-Visual Scene-Aware Dialogues (AVSD) 
    
    ## Description
    [Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided. 
    
    
    ## Task
    
    (https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)
    
    In a **video-grounded dialogue task**, the system must generate responses to user input in the context of a given dialog.
    This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative
    
    ## Metrics
    Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics. 
    
    ## Leaderboard
    
    ....
    
    
    ## Auto-Downloading
    
    Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instructions to download the dataset. 
    
    
    ## References
    "Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

Visual Data Type
==============================
We currently limit the visual data types to one of three options: ``images``, ``videos``, and ``features``. 
"Images" and "videos" refer to the raw visual data, which is appropriate for models processing visual data in their original forms (e.g. ViT models). 
"Features" are visual representations extracted from pretrained models (e.g. CNN models). 
In this tutorial, the AVSD benchmark consists of video features extracted from 3D-CNN models. 

Build Info
==============================
Build info refers to the specific locations where data is stored and cached. 

For text annotations (e.g. captioning or dialogues), by default, we include three data splits, namely "train", "val", and "test", typically used in all machine learning projects. 
For each split, we specify 2 parameters: ``url``  and ``storage``.
``url`` can be either an online URL where the dataset can be loaded automatically (e.g. from *googleapis*), or a local directory where data is already downloaded beforehand. 
``storage`` is the directory where the data will be cached over time, avoiding downloading data repeatedly.

For visual data annotations, ensure the field name matches the data types defined earlier (e.g. one of "images", "videos" or features"). 
As visual features are usually large and should be downloaded beforehand, we maintain only a ``storage`` parameter where visual data is cached. 

Dataset ``lavis.datasets.datasets``
**************************************************************

Base Dataset ``lavis.datasets.datasets.base_dataset``
=======================================================
In this step, we want to define new dataset classes that inherit our base dataset class ``lavis.datasets.datasets.base_dataset``. This base dataset class already defines standard methods such as ``collater`` which uses the default collator from Pytorch. 

.. code-block:: python

    import json
    from typing import Iterable
    
    from torch.utils.data import Dataset, ConcatDataset
    from torch.utils.data.dataloader import default_collate
        
    class BaseDataset(Dataset):
        def __init__(
            self, vis_processor=None, text_processor=None, vis_root=None, ann_paths=[]
        ):
            """
            vis_root (string): Root directory of images (e.g. coco/images/)
            ann_root (string): directory to store the annotation file
            """
            self.vis_root = vis_root
    
            self.annotation = []
            for ann_path in ann_paths:
                self.annotation.extend(json.load(open(ann_path, "r")))
    
            self.vis_processor = vis_processor
            self.text_processor = text_processor
    
            self._add_instance_ids()
    
        def __len__(self):
            return len(self.annotation)
    
        def collater(self, samples):
            return default_collate(samples)
    
        def set_processors(self, vis_processor, text_processor):
            self.vis_processor = vis_processor
            self.text_processor = text_processor
    
        def _add_instance_ids(self, key="instance_id"):
            for idx, ann in enumerate(self.annotation):
                ann[key] = str(idx)

Any dataset subclass will inherit these methods and it is optional to define and overwrite these methods accordingly to the specifications of the dataset. 
We encourage users not to modify the base dataset class as any modification will have cascading impacts on any other dataset classes that inherit this base dataset. 
Instead, the users should independently create new dataset classes to cater to their specific requirements. 

Dialogue Datasets ``lavis.datasets.datasets.dialogue_datasets``
======================================================================

For example, for the AVSD dataset, we want to define a new dataset subclass ``DialogueDataset`` for dialogue tasks. We can define this dataset class in ``lavis.datasets.datasets.dialogue_datasets`` as following: 

.. code-block:: python

    import os
    from collections import OrderedDict
        
    from lavis.datasets.datasets.base_dataset import BaseDataset
    
    import json 
    import copy 

    class DialogueDataset(BaseDataset):
        def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
            """
            vis_processor (string): visual processor 
            text_processor (string): textual processor 
            vis_root (string): Root directory of images (e.g. coco/images/)
            ann_paths (string): Root directory of images (e.g. coco/images/)
            """
                
            self.vis_root = vis_root
    
            self.annotation = []
            for ann_path in ann_paths:
                dialogs = json.load(open(ann_path, "r"))['dialogs']
                for dialog in dialogs: 
                    all_turns = dialog['dialog']
                    dialogue_context = [] 
                    for turn in all_turns: 
                        dialog_instance = copy.deepcopy(dialog)
                        question = turn['question']
                        answer = turn['answer'] 
                        
                        dialog_instance['dialog'] = copy.deepcopy(dialogue_context) 
                        dialog_instance['question'] = question
                        dialog_instance['answer'] = answer 
                        self.annotation.append(dialog_instance)
                        dialogue_context.append(turn)
                        
            self.vis_processor = vis_processor
            self.text_processor = text_processor
    
            self._add_instance_ids()
    
            self.img_ids = {}
            n = 0
            for ann in self.annotation:
                img_id = ann["image_id"]
                if img_id not in self.img_ids.keys():
                    self.img_ids[img_id] = n
                    n += 1

Class inheritance allows us to define multiple subclasses. For instance, we want another dialogue dataset class that is defined only for the test split. We can define another dataset class ``DialogueEvalDataset`` as similarly defined above but the annotations are processed differently. 
Typically, in dialogue tasks, during test time, only a single test sample is constructed per dialogue (rather than decomposing all dialogue turns as samples during training time).
The dataset class can then be defined as: 

.. code-block:: python

    class DialogueEvalDataset(BaseDataset):
        def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
            # ...
            # defined similarly as DialogueDataset above 
            # except for the loading of dialogue annotation data            
    
            self.annotation = []
            for ann_path in ann_paths:
                dialogs = json.load(open(ann_path, "r"))['dialogs']
                for dialog in dialogs: 
                    all_turns = dialog['dialog']
                    dialogue_context = all_turns[:-1]
                    last_turn = all_turns[-1] 
                    
                    question = last_turn['question']
                    answer = last_turn['answer'] 
                        
                    dialog['dialog'] = dialogue_context
                    dialog['question'] = question
                    dialog['answer'] = answer
                                        
                    self.annotation.append(dialog)


Using class inheritance to define datasets also allows us to develop more fine-grain class implementations, each of which is specifically designated for a benchmark. 
For instance, under the dialogue-based tasks, we can further define another dataset subclass that is specified for the AVSD dataset. 
We can define a new class ``AVSDDialDataset`` that further specifies how to load individual samples and collate them accordingly to specific requirements: 

.. code-block:: python

    import os
    from lavis.datasets.datasets.base_dataset import BaseDataset
    from lavis.datasets.datasets.dialogue_datasets import DialogueDataset, DialogueEvalDataset
    
    import torch 
        
    class AVSDDialDataset(DialogueDataset):
        def __init__(self, vis_processor, text_processor, vis_root, ann_paths):

            super().__init__(vis_processor, text_processor, vis_root, ann_paths)
    
        def __getitem__(self, index):
    
            ann = self.annotation[index]
    
            vname = ann["image_id"]
    
            video = self.vis_processor(self.vis_root, vname)
            
            dialogue = self.text_processor(ann)
            
            return {
                "video_fts": video['video_fts'],
                "video_token_type_ids": video['token_type_ids'], 
                "input_ids": dialogue['input_ids'], 
                "token_type_ids": dialogue['token_type_ids'],
                "labels": dialogue['labels'], 
                "image_id": ann["image_id"],
                "instance_id": ann["instance_id"]
            }
        
        def collater(self, samples):
            
            input_ids, token_type_ids, labels, video_fts, video_token_type_ids = [], [], [], [], []
            
            for i in samples:
                input_ids.append(i['input_ids'])
                token_type_ids.append(i['token_type_ids'])
                labels.append(i['labels'])
                video_fts.append(i['video_fts'])
                video_token_type_ids.append(i['video_token_type_ids'])
    
            input_ids = self.text_processor.padding(input_ids)
            
            labels = self.text_processor.padding(labels, -1)
            video_fts = self.vis_processor.padding(video_fts)
            
            token_type_ids = self.text_processor.padding(token_type_ids)
            video_token_type_ids = self.text_processor.padding(video_token_type_ids)
            token_type_ids = torch.cat([video_token_type_ids, token_type_ids], dim=1)
            
            attn_mask = self.text_processor.get_attention_mask(input_ids)
            video_mask = self.vis_processor.get_attention_mask(video_fts)
            attn_mask = torch.cat([video_mask, attn_mask], dim=1)
            
            video_labels = torch.ones((video_fts.size(0), video_fts.size(1))).long() * -1 # ignore token indice -1 by default 

            labels = torch.cat([video_labels, labels], dim=1)
            
            samples = {}
            samples['input_ids'] = input_ids
            samples['token_type_ids'] = token_type_ids
            samples['labels'] = labels
            samples['video_fts'] = video_fts
            samples['attn_mask'] = attn_mask
            
            return samples  

Note that in a dataset subclass, if methods such as ``__getitem__`` and ``collater`` are not defined, the same functions from the corresponding superclass will be used. 
For instance, by default, we always use the collater from the ``BaseDataset`` class to collate data samples. 

Dataset Builder ``lavis.datasets.builders``
**************************************************************
Dataset Builder is the data processing module that controls the dataset classes (by training or evaluation split) and associates the specific dataset configurations to these dataset classes. 

Base Dataset Builder ``lavis.datasets.builders.base_dataset_builder``
======================================================================

Note that any new builder class definition should inherit the base dataset builder class ``lavis.datasets.builders.base_dataset_builder``:

.. code-block:: python

    class BaseDatasetBuilder:
        train_dataset_cls, eval_dataset_cls = None, None
        ...

This allows us to standardize the operations of dataset builders across all builder classes. We advise the users to carefully review the standard methods defined in the base builder class, including methods such as ``_download_data`` and ``build_dataset`` that will load download the data and create instances of dataset classes: 

.. code-block:: python

    class BaseDatasetBuilder:
    ...

        def build_datasets(self):
            # download, split, etc...
            # only called on 1 GPU/TPU in distributed
    
            if is_main_process():
                self._download_data()
    
            if is_dist_avail_and_initialized():
                dist.barrier()
    
            # at this point, all the annotations and image/videos should be all downloaded to the specified locations.
            logging.info("Building datasets...")
            datasets = self.build()  # dataset['train'/'val'/'test']
            
            return datasets
    
        def _download_data(self):
            self._download_ann()
            self._download_vis()
    
We encourage users not to modify the implementation of the base dataset builder class as this will affect all existing dataset builder subclasses.

Dialogue Dataset Builder ``lavis.datasets.builders.dialogue_builder``
======================================================================
We can define any new builder subclass and associate this builder with the corresponding dataset classes and dataset configurations. 
For instance, for the AVSD dataset, we can define a builder ``lavis.datasets.builders.dialogue_builder`` for dialogue-based datasets as follows: 

.. code-block:: python

    from lavis.datasets.builders.base_dataset_builder import BaseDatasetBuilder
    from lavis.datasets.datasets.avsd_dialogue_datasets import (
        AVSDDialDataset, 
        AVSDDialEvalDataset 
    )
    
    from lavis.common.registry import registry
    
    
    @registry.register_builder("avsd_dialogue")
    class AVSDDialBuilder(BaseDatasetBuilder):
        train_dataset_cls = AVSDDialDataset 
        eval_dataset_cls = AVSDDialEvalDataset 
    
        DATASET_CONFIG_DICT = {
            "default": "configs/datasets/avsd/defaults_dial.yaml"
        }

Note that we chose to separately define the parameters ``train_dataset_cls`` and  ``eval_dataset_cls`` to consider cases where data is processed differently between training and test time. 
For instance, in captioning tasks, during test time, each data sample often includes multiple ground-truth captions rather than just a single ground-truth during training time. 
If the data processing is the same in both training and test time, the two parameters can be linked to the same dataset class. 

Finally, define ``DATASET_CONFIG_DICT`` to associate the dataset configurations to the assigned dataset classes. 

Registering Builder ``lavis.datasets.builders.__init__``
======================================================================

To add a new builder class, ensure to first include the class within the ``__init__.py``. For instance, to define a new builder for the AVSD dataset: 

.. code-block:: python

    from lavis.datasets.builders.dialogue_builder import (
        AVSDDialBuilder
    )
    
    __all__ = [
        ...,
        "AVSDDialBuilder"
    ]

Assigning Builder 
======================================================================
Note that during data loading and processing, the builder being assigned must have the correct registry to be able to load it properly. 
For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``: 

.. code-block:: yaml

    datasets:
      avsd_dialogue: # name of the dataset builder
        ...
        # processor configuration 
        ...

Subsequently, any processes (e.g. training) should load this configuration file to assign the correct builder which will then associate the correct dataset classes to construct data samples. 

.. code-block:: sh

    python train.py --cfg-path dialogue_avsd_ft.yaml


================================================
FILE: docs/tutorial.evaluation.rst
================================================
Evaluating Pre-trained Models on Task Datasets
###############################################
LAVIS provides pre-trained and finetuned model for off-the-shelf evaluation on task dataset. 
Let's now see an example to evaluate BLIP model on the captioning task, using MSCOCO dataset.

.. _prep coco:

Preparing Datasets
******************
First, let's download the dataset. LAVIS provides `automatic downloading scripts` to help prepare 
most of the public dataset, to download MSCOCO dataset, simply run

.. code-block:: bash

    cd lavis/datasets/download_scripts && python download_coco.py

This will put the downloaded dataset at a default cache location ``cache`` used by LAVIS.

If you want to use a different cache location, you can specify it by updating ``cache_root`` in ``lavis/configs/default.yaml``.

If you have a local copy of the dataset, it is recommended to create a symlink from the cache location to the local copy, e.g.

.. code-block:: bash

    ln -s /path/to/local/coco cache/coco

Evaluating pre-trained models
******************************

To evaluate pre-trained model, simply run

.. code-block:: bash

    bash run_scripts/blip/eval/eval_coco_cap.sh

Or to evaluate a large model:

.. code-block:: bash

    bash run_scripts/blip/eval/eval_coco_cap_large.sh


================================================
FILE: docs/tutorial.models.rst
================================================
Adding Models
####################################

This is a tutorial on adding new models using ``lavis.models`` module.

The LAVIS library includes a standard model module that builds the foundation for many major language-vision models such as `ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_,
`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, and `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_. 
The ``lavis.models`` module is designed such that any new models can be added and integrated into the LAVIS library, with minimal steps to develop training and testing procedures. 
In this tutorial, we will replicate the steps to add a GPT-style model specifically for `video-grounded dialogue tasks <https://arxiv.org/pdf/1901.09107.pdf>`_. 

Base Model ``lavis.models.base_model``
**************************************************************

Note that any new model definition should inherit the base model class ``BaseModel``:

.. code-block:: python

    from omegaconf import OmegaConf
    
    import numpy as np
    
    import torch
    import torch.nn as nn
    
    from lavis.common.utils import get_abs_path
    
    class BaseModel(nn.Module):
        """Base class for models."""
    
        def __init__(self):
            super().__init__()
    
        def forward_features(self, *args, **kwargs):
            """Similar to *forward* but only return features."""
            raise NotImplementedError
    
        def load_from_pretrained(self, url_or_filename):
            raise NotImplementedError
    
        @classmethod
        def _from_config(cls, cfg=None, model_type="base"):
            if not cfg:
                # useful when building model without a provided configuration file
                cfg = OmegaConf.load(cls.default_config_path(model_type)).model
    
            return cls.from_config(cfg)
    
        @classmethod
        def from_pretrained(cls, model_type="base"):
            """
            Build a pretrained model from the default configuration file, specified by model_type.
            """
            return cls._from_config(cfg=None, model_type=model_type)
    
        @property
        def device(self):
            return list(self.parameters())[0].device
    
        @classmethod
        def default_config_path(cls, model_type="base"):
            assert (
                model_type in cls.PRETRAINED_MODEL_CONFIG_DICT
            ), "Unknown model type {}".format(model_type)
            return get_abs_path(cls.PRETRAINED_MODEL_CONFIG_DICT[model_type])
    
        def before_evaluation(self, **kwargs):
            pass
    
        def show_n_params(self, return_str=True):
            tot = 0
            for p in self.parameters():
                w = 1
                for x in p.shape:
                    w *= x
                tot += w
            if return_str:
                if tot >= 1e6:
                    return "{:.1f}M".format(tot / 1e6)
                else:
                    return "{:.1f}K".format(tot / 1e3)
            else:
                return tot


In this base model, we already declare and standardize many common methods such as ``_from_config`` and ``_from_pretrained``. 
Inheriting this base model class allows us to standardize operations of models across all model classes while still allowing customizations. 
We advise users not to change the implementation of the base model class as this will affect all existing model subclasses.

GPT-style Video-grounded Dialogue Model ``lavis.models.gpt_models.gpt_dialogue``
********************************************************************************

In this step, we can define a new model class, e.g. under ``lavis.models.gpt_models.gpt_dialogue``, for GPT-based dialogue models designed specifically for video-grounded dialogues. 
Note that we assume the model class inherits from the standard model super class ``GPT2LMHeadModel`` from the ``transformers`` `library <https://huggingface.co/docs/transformers/index>`_.
We also enforce model integration to the LAVIS framework through the inheritance of the ``BaseModel`` from the LAVIS library, as the secondary super class.

.. code-block:: python

    import torch
    from lavis.common.registry import registry
    from lavis.models.base_model import BaseModel
    
    from transformers import GPT2Model, GPT2LMHeadModel
    from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
    import math
    import torch
    import torch.nn as nn
    from torch.nn import CrossEntropyLoss, MSELoss
        
    @registry.register_model("gpt_dialogue")
    class GPTDialogue(GPT2LMHeadModel, BaseModel):
        ...
 
Next, we can modify the architecture of the model during model initialization to fit the tasks of interest, i.e. video-grounded dialogues. 
In this case, we want to add additional model parameters for a linear network to transform the video feature representations to the model dimension. 

.. code-block:: python

    class GPTDialogue(GPT2LMHeadModel, BaseModel):

        def __init__(self, config, len_video_ft=4224):
            
            super().__init__(config)
            
            self.video_ff = nn.Linear(len_video_ft, config.n_embd)
       
            # Model parallel
            self.model_parallel = False
            self.device_map = None
    
            # Initialize weights and apply final processing
            self.post_init()
    
Note that for each new model class, we advise redefining the ``from_config`` method which is inherited from the ``BaseModel`` class.
As each model usually has its own unique configurations, redefining the method will ensure the model instances are created properly. 
For instance, ``GPTDialogue`` requires an additional parameter of video feature length (``len_video_ft``) which should be part of the model initialization procedure. 
Another additional parameter is the number of tokens/words (as we include additional special tokens in the vocabulary for dialogue tasks). 

.. code-block:: python

    class GPTDialogue(GPT2LMHeadModel, BaseModel):
        ...
        @classmethod
        def from_config(cls, cfg):
            model = cls.from_pretrained('gpt2', len_video_ft=cfg['len_video_ft']) 
            model.resize_token_embeddings(cfg['len_tokenizer'])
            return model

Other basic methods should also be defined explicitly in the new model class, including the ``forward`` function. 
For instance, in GPT models for video-grounded dialogue tasks, we want the forward operation also includes the transformation and integration of video features before passing the representations to the Transformer layers. 

.. code-block:: python

    class GPTDialogue(GPT2LMHeadModel, BaseModel):
        ...

        def forward(self, samples, 
                    past_key_values=None,
                    position_ids=None,
                    head_mask=None,
                    encoder_hidden_states=None,
                    encoder_attention_mask=None,
                    use_cache=None,
                    output_attentions=None,
                    output_hidden_states=None,
                    return_dict=None):        
                
                input_embs = self.transformer.wte(samples['input_ids'])
                video_embs = self.video_ff(samples['video_fts'])
                input_embs = torch.cat([video_embs, input_embs], dim=1)
                        
                transformer_outputs = self.transformer(
                    attention_mask=samples['attn_mask'],
                    token_type_ids=samples['token_type_ids'],
                    inputs_embeds=input_embs,
                    position_ids=position_ids,
                    head_mask=head_mask,
                    encoder_hidden_states=encoder_hidden_states,
                    encoder_attention_mask=encoder_attention_mask,
                    use_cache=use_cache,
                    output_attentions=output_attentions,
                    output_hidden_states=output_hidden_states,
                    return_dict=return_dict,
                )
                hidden_states = transformer_outputs[0]
            
                lm_logits = self.lm_head(hidden_states)
                ...

Registering New Model ``lavis.models.__init__``
********************************************************************************

Any new model must be officially registered as part of the ``lavis.models`` module. 
For instance, to add a model class for GPT-based dialogue models, we can modify the ``__init__.py`` as follows:

.. code-block:: python

    from lavis.models.gpt_models.gpt_dialogue import GPTDialogue
    
    __all__ = [
        ...
        "GPTDialogue"
    ]

Assigning Model
********************************************************************************

From the above example of a model class, note that we define a ``from_config method`` for the new model class. 
This method will process a configuration file and pass specific parameters to initialize the model classes properly. 
To do this, we can assign/ associate the correct registry of model classes in a configuration file. 
For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``:

.. code-block:: yaml

    model:
      arch: gpt_dialogue # name of the model 
      model_type: base


Subsequently, any processes (e.g. training) should load this configuration file to assign the correct model.

.. code-block:: sh

    python train.py --cfg-path dialogue_avsd_ft.yaml

Note that to simplify the model configuration, we only enable two main parameters here: ``arch`` and ``model_type``. ``arch`` refers to the model class registry, and ``model_type`` is the corresponding model type under this model family.
For instance, with ``gpt_dialogue``, we have a model ``base`` which has its own configuration in a separate configuration file e.g. ``gp

Download .txt

gitextract__ei5npya/

├── .github/
│   └── workflows/
│       └── docs.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── LICENSE.txt
├── MANIFEST.in
├── README.md
├── SECURITY.md
├── app/
│   ├── __init__.py
│   ├── calculate_coco_features.py
│   ├── caption.py
│   ├── classification.py
│   ├── dataset_browser.py
│   ├── image_text_match.py
│   ├── main.py
│   ├── multimodal_search.py
│   ├── multipage.py
│   ├── text_localization.py
│   ├── utils.py
│   └── vqa.py
├── dataset_card/
│   ├── avsd_dialogue.md
│   ├── coco_caption.md
│   ├── coco_retrieval.md
│   ├── conceptual_captions.md
│   ├── didemo_retrieval.md
│   ├── flickr_retrieval.md
│   ├── gqa.md
│   ├── msrvtt_qa.md
│   ├── msrvtt_retrieval.md
│   ├── msvd_qa.md
│   ├── nlvr2.md
│   ├── nocaps.md
│   ├── sbu_caption.md
│   ├── snli_visual_entailment.md
│   └── vqav2.md
├── docs/
│   ├── Makefile
│   ├── benchmark.rst
│   ├── build_docs.sh
│   ├── conf.py
│   ├── getting_started.rst
│   ├── index.rst
│   ├── intro.rst
│   ├── make.bat
│   ├── requirements.txt
│   ├── tutorial.configs.rst
│   ├── tutorial.datasets.rst
│   ├── tutorial.evaluation.rst
│   ├── tutorial.models.rst
│   ├── tutorial.processors.rst
│   ├── tutorial.rst
│   ├── tutorial.tasks.rst
│   └── tutorial.training-example.rst
├── evaluate.py
├── examples/
│   ├── albef_feature_extraction.ipynb
│   ├── albef_vqa.ipynb
│   ├── albef_zero_shot_classification.ipynb
│   ├── blip2_feature_extraction.ipynb
│   ├── blip2_image_text_matching.ipynb
│   ├── blip2_instructed_generation.ipynb
│   ├── blip_feature_extraction.ipynb
│   ├── blip_image_captioning.ipynb
│   ├── blip_image_text_matching.ipynb
│   ├── blip_text_localization.ipynb
│   ├── blip_vqa.ipynb
│   ├── blip_zero_shot_classification.ipynb
│   ├── clip_feature_extraction.ipynb
│   └── clip_zero_shot_classification.ipynb
├── lavis/
│   ├── __init__.py
│   ├── common/
│   │   ├── annotator/
│   │   │   ├── canny/
│   │   │   │   └── __init__.py
│   │   │   ├── ckpts/
│   │   │   │   └── download.sh
│   │   │   ├── hed/
│   │   │   │   └── __init__.py
│   │   │   ├── midas/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── api.py
│   │   │   │   ├── midas/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── base_model.py
│   │   │   │   │   ├── blocks.py
│   │   │   │   │   ├── dpt_depth.py
│   │   │   │   │   ├── midas_net.py
│   │   │   │   │   ├── midas_net_custom.py
│   │   │   │   │   ├── transforms.py
│   │   │   │   │   └── vit.py
│   │   │   │   └── utils.py
│   │   │   ├── mlsd/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── models/
│   │   │   │   │   ├── mbv2_mlsd_large.py
│   │   │   │   │   └── mbv2_mlsd_tiny.py
│   │   │   │   └── utils.py
│   │   │   ├── openpose/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── body.py
│   │   │   │   ├── hand.py
│   │   │   │   ├── model.py
│   │   │   │   └── util.py
│   │   │   ├── uniformer/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── configs/
│   │   │   │   │   └── _base_/
│   │   │   │   │       ├── datasets/
│   │   │   │   │       │   ├── ade20k.py
│   │   │   │   │       │   ├── chase_db1.py
│   │   │   │   │       │   ├── cityscapes.py
│   │   │   │   │       │   ├── cityscapes_769x769.py
│   │   │   │   │       │   ├── drive.py
│   │   │   │   │       │   ├── hrf.py
│   │   │   │   │       │   ├── pascal_context.py
│   │   │   │   │       │   ├── pascal_context_59.py
│   │   │   │   │       │   ├── pascal_voc12.py
│   │   │   │   │       │   ├── pascal_voc12_aug.py
│   │   │   │   │       │   └── stare.py
│   │   │   │   │       ├── default_runtime.py
│   │   │   │   │       ├── models/
│   │   │   │   │       │   ├── ann_r50-d8.py
│   │   │   │   │       │   ├── apcnet_r50-d8.py
│   │   │   │   │       │   ├── ccnet_r50-d8.py
│   │   │   │   │       │   ├── cgnet.py
│   │   │   │   │       │   ├── danet_r50-d8.py
│   │   │   │   │       │   ├── deeplabv3_r50-d8.py
│   │   │   │   │       │   ├── deeplabv3_unet_s5-d16.py
│   │   │   │   │       │   ├── deeplabv3plus_r50-d8.py
│   │   │   │   │       │   ├── dmnet_r50-d8.py
│   │   │   │   │       │   ├── dnl_r50-d8.py
│   │   │   │   │       │   ├── emanet_r50-d8.py
│   │   │   │   │       │   ├── encnet_r50-d8.py
│   │   │   │   │       │   ├── fast_scnn.py
│   │   │   │   │       │   ├── fcn_hr18.py
│   │   │   │   │       │   ├── fcn_r50-d8.py
│   │   │   │   │       │   ├── fcn_unet_s5-d16.py
│   │   │   │   │       │   ├── fpn_r50.py
│   │   │   │   │       │   ├── fpn_uniformer.py
│   │   │   │   │       │   ├── gcnet_r50-d8.py
│   │   │   │   │       │   ├── lraspp_m-v3-d8.py
│   │   │   │   │       │   ├── nonlocal_r50-d8.py
│   │   │   │   │       │   ├── ocrnet_hr18.py
│   │   │   │   │       │   ├── ocrnet_r50-d8.py
│   │   │   │   │       │   ├── pointrend_r50.py
│   │   │   │   │       │   ├── psanet_r50-d8.py
│   │   │   │   │       │   ├── pspnet_r50-d8.py
│   │   │   │   │       │   ├── pspnet_unet_s5-d16.py
│   │   │   │   │       │   ├── upernet_r50.py
│   │   │   │   │       │   └── upernet_uniformer.py
│   │   │   │   │       └── schedules/
│   │   │   │   │           ├── schedule_160k.py
│   │   │   │   │           ├── schedule_20k.py
│   │   │   │   │           ├── schedule_40k.py
│   │   │   │   │           └── schedule_80k.py
│   │   │   │   ├── exp/
│   │   │   │   │   └── upernet_global_small/
│   │   │   │   │       ├── config.py
│   │   │   │   │       ├── run.sh
│   │   │   │   │       ├── test.sh
│   │   │   │   │       ├── test_config_g.py
│   │   │   │   │       ├── test_config_h32.py
│   │   │   │   │       └── test_config_w32.py
│   │   │   │   ├── mmcv/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── arraymisc/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── quantization.py
│   │   │   │   │   ├── cnn/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── alexnet.py
│   │   │   │   │   │   ├── bricks/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── activation.py
│   │   │   │   │   │   │   ├── context_block.py
│   │   │   │   │   │   │   ├── conv.py
│   │   │   │   │   │   │   ├── conv2d_adaptive_padding.py
│   │   │   │   │   │   │   ├── conv_module.py
│   │   │   │   │   │   │   ├── conv_ws.py
│   │   │   │   │   │   │   ├── depthwise_separable_conv_module.py
│   │   │   │   │   │   │   ├── drop.py
│   │   │   │   │   │   │   ├── generalized_attention.py
│   │   │   │   │   │   │   ├── hsigmoid.py
│   │   │   │   │   │   │   ├── hswish.py
│   │   │   │   │   │   │   ├── non_local.py
│   │   │   │   │   │   │   ├── norm.py
│   │   │   │   │   │   │   ├── padding.py
│   │   │   │   │   │   │   ├── plugin.py
│   │   │   │   │   │   │   ├── registry.py
│   │   │   │   │   │   │   ├── scale.py
│   │   │   │   │   │   │   ├── swish.py
│   │   │   │   │   │   │   ├── transformer.py
│   │   │   │   │   │   │   ├── upsample.py
│   │   │   │   │   │   │   └── wrappers.py
│   │   │   │   │   │   ├── builder.py
│   │   │   │   │   │   ├── resnet.py
│   │   │   │   │   │   ├── utils/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── flops_counter.py
│   │   │   │   │   │   │   ├── fuse_conv_bn.py
│   │   │   │   │   │   │   ├── sync_bn.py
│   │   │   │   │   │   │   └── weight_init.py
│   │   │   │   │   │   └── vgg.py
│   │   │   │   │   ├── engine/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── test.py
│   │   │   │   │   ├── fileio/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── file_client.py
│   │   │   │   │   │   ├── handlers/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── base.py
│   │   │   │   │   │   │   ├── json_handler.py
│   │   │   │   │   │   │   ├── pickle_handler.py
│   │   │   │   │   │   │   └── yaml_handler.py
│   │   │   │   │   │   ├── io.py
│   │   │   │   │   │   └── parse.py
│   │   │   │   │   ├── image/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── colorspace.py
│   │   │   │   │   │   ├── geometric.py
│   │   │   │   │   │   ├── io.py
│   │   │   │   │   │   ├── misc.py
│   │   │   │   │   │   └── photometric.py
│   │   │   │   │   ├── model_zoo/
│   │   │   │   │   │   ├── deprecated.json
│   │   │   │   │   │   ├── mmcls.json
│   │   │   │   │   │   └── open_mmlab.json
│   │   │   │   │   ├── ops/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── assign_score_withk.py
│   │   │   │   │   │   ├── ball_query.py
│   │   │   │   │   │   ├── bbox.py
│   │   │   │   │   │   ├── border_align.py
│   │   │   │   │   │   ├── box_iou_rotated.py
│   │   │   │   │   │   ├── carafe.py
│   │   │   │   │   │   ├── cc_attention.py
│   │   │   │   │   │   ├── contour_expand.py
│   │   │   │   │   │   ├── corner_pool.py
│   │   │   │   │   │   ├── correlation.py
│   │   │   │   │   │   ├── deform_conv.py
│   │   │   │   │   │   ├── deform_roi_pool.py
│   │   │   │   │   │   ├── deprecated_wrappers.py
│   │   │   │   │   │   ├── focal_loss.py
│   │   │   │   │   │   ├── furthest_point_sample.py
│   │   │   │   │   │   ├── fused_bias_leakyrelu.py
│   │   │   │   │   │   ├── gather_points.py
│   │   │   │   │   │   ├── group_points.py
│   │   │   │   │   │   ├── info.py
│   │   │   │   │   │   ├── iou3d.py
│   │   │   │   │   │   ├── knn.py
│   │   │   │   │   │   ├── masked_conv.py
│   │   │   │   │   │   ├── merge_cells.py
│   │   │   │   │   │   ├── modulated_deform_conv.py
│   │   │   │   │   │   ├── multi_scale_deform_attn.py
│   │   │   │   │   │   ├── nms.py
│   │   │   │   │   │   ├── pixel_group.py
│   │   │   │   │   │   ├── point_sample.py
│   │   │   │   │   │   ├── points_in_boxes.py
│   │   │   │   │   │   ├── points_sampler.py
│   │   │   │   │   │   ├── psa_mask.py
│   │   │   │   │   │   ├── roi_align.py
│   │   │   │   │   │   ├── roi_align_rotated.py
│   │   │   │   │   │   ├── roi_pool.py
│   │   │   │   │   │   ├── roiaware_pool3d.py
│   │   │   │   │   │   ├── roipoint_pool3d.py
│   │   │   │   │   │   ├── saconv.py
│   │   │   │   │   │   ├── scatter_points.py
│   │   │   │   │   │   ├── sync_bn.py
│   │   │   │   │   │   ├── three_interpolate.py
│   │   │   │   │   │   ├── three_nn.py
│   │   │   │   │   │   ├── tin_shift.py
│   │   │   │   │   │   ├── upfirdn2d.py
│   │   │   │   │   │   └── voxelize.py
│   │   │   │   │   ├── parallel/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── _functions.py
│   │   │   │   │   │   ├── collate.py
│   │   │   │   │   │   ├── data_container.py
│   │   │   │   │   │   ├── data_parallel.py
│   │   │   │   │   │   ├── distributed.py
│   │   │   │   │   │   ├── distributed_deprecated.py
│   │   │   │   │   │   ├── registry.py
│   │   │   │   │   │   ├── scatter_gather.py
│   │   │   │   │   │   └── utils.py
│   │   │   │   │   ├── runner/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── base_module.py
│   │   │   │   │   │   ├── base_runner.py
│   │   │   │   │   │   ├── builder.py
│   │   │   │   │   │   ├── checkpoint.py
│   │   │   │   │   │   ├── default_constructor.py
│   │   │   │   │   │   ├── dist_utils.py
│   │   │   │   │   │   ├── epoch_based_runner.py
│   │   │   │   │   │   ├── fp16_utils.py
│   │   │   │   │   │   ├── hooks/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── checkpoint.py
│   │   │   │   │   │   │   ├── closure.py
│   │   │   │   │   │   │   ├── ema.py
│   │   │   │   │   │   │   ├── evaluation.py
│   │   │   │   │   │   │   ├── hook.py
│   │   │   │   │   │   │   ├── iter_timer.py
│   │   │   │   │   │   │   ├── logger/
│   │   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   │   ├── base.py
│   │   │   │   │   │   │   │   ├── dvclive.py
│   │   │   │   │   │   │   │   ├── mlflow.py
│   │   │   │   │   │   │   │   ├── neptune.py
│   │   │   │   │   │   │   │   ├── pavi.py
│   │   │   │   │   │   │   │   ├── tensorboard.py
│   │   │   │   │   │   │   │   ├── text.py
│   │   │   │   │   │   │   │   └── wandb.py
│   │   │   │   │   │   │   ├── lr_updater.py
│   │   │   │   │   │   │   ├── memory.py
│   │   │   │   │   │   │   ├── momentum_updater.py
│   │   │   │   │   │   │   ├── optimizer.py
│   │   │   │   │   │   │   ├── profiler.py
│   │   │   │   │   │   │   ├── sampler_seed.py
│   │   │   │   │   │   │   └── sync_buffer.py
│   │   │   │   │   │   ├── iter_based_runner.py
│   │   │   │   │   │   ├── log_buffer.py
│   │   │   │   │   │   ├── optimizer/
│   │   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   │   ├── builder.py
│   │   │   │   │   │   │   └── default_constructor.py
│   │   │   │   │   │   ├── priority.py
│   │   │   │   │   │   └── utils.py
│   │   │   │   │   ├── utils/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── config.py
│   │   │   │   │   │   ├── env.py
│   │   │   │   │   │   ├── ext_loader.py
│   │   │   │   │   │   ├── logging.py
│   │   │   │   │   │   ├── misc.py
│   │   │   │   │   │   ├── parrots_jit.py
│   │   │   │   │   │   ├── parrots_wrapper.py
│   │   │   │   │   │   ├── path.py
│   │   │   │   │   │   ├── progressbar.py
│   │   │   │   │   │   ├── registry.py
│   │   │   │   │   │   ├── testing.py
│   │   │   │   │   │   ├── timer.py
│   │   │   │   │   │   ├── trace.py
│   │   │   │   │   │   └── version_utils.py
│   │   │   │   │   ├── version.py
│   │   │   │   │   ├── video/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   ├── io.py
│   │   │   │   │   │   ├── optflow.py
│   │   │   │   │   │   └── processing.py
│   │   │   │   │   └── visualization/
│   │   │   │   │       ├── __init__.py
│   │   │   │   │       ├── color.py
│   │   │   │   │       ├── image.py
│   │   │   │   │       └── optflow.py
│   │   │   │   ├── mmcv_custom/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   └── checkpoint.py
│   │   │   │   └── mmseg/
│   │   │   │       ├── apis/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── inference.py
│   │   │   │       │   ├── test.py
│   │   │   │       │   └── train.py
│   │   │   │       ├── core/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── evaluation/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── class_names.py
│   │   │   │       │   │   ├── eval_hooks.py
│   │   │   │       │   │   └── metrics.py
│   │   │   │       │   ├── seg/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── builder.py
│   │   │   │       │   │   └── sampler/
│   │   │   │       │   │       ├── __init__.py
│   │   │   │       │   │       ├── base_pixel_sampler.py
│   │   │   │       │   │       └── ohem_pixel_sampler.py
│   │   │   │       │   └── utils/
│   │   │   │       │       ├── __init__.py
│   │   │   │       │       └── misc.py
│   │   │   │       ├── datasets/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── ade.py
│   │   │   │       │   ├── builder.py
│   │   │   │       │   ├── chase_db1.py
│   │   │   │       │   ├── cityscapes.py
│   │   │   │       │   ├── custom.py
│   │   │   │       │   ├── dataset_wrappers.py
│   │   │   │       │   ├── drive.py
│   │   │   │       │   ├── hrf.py
│   │   │   │       │   ├── pascal_context.py
│   │   │   │       │   ├── pipelines/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── compose.py
│   │   │   │       │   │   ├── formating.py
│   │   │   │       │   │   ├── loading.py
│   │   │   │       │   │   ├── test_time_aug.py
│   │   │   │       │   │   └── transforms.py
│   │   │   │       │   ├── stare.py
│   │   │   │       │   └── voc.py
│   │   │   │       ├── models/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── backbones/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── cgnet.py
│   │   │   │       │   │   ├── fast_scnn.py
│   │   │   │       │   │   ├── hrnet.py
│   │   │   │       │   │   ├── mobilenet_v2.py
│   │   │   │       │   │   ├── mobilenet_v3.py
│   │   │   │       │   │   ├── resnest.py
│   │   │   │       │   │   ├── resnet.py
│   │   │   │       │   │   ├── resnext.py
│   │   │   │       │   │   ├── unet.py
│   │   │   │       │   │   ├── uniformer.py
│   │   │   │       │   │   └── vit.py
│   │   │   │       │   ├── builder.py
│   │   │   │       │   ├── decode_heads/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── ann_head.py
│   │   │   │       │   │   ├── apc_head.py
│   │   │   │       │   │   ├── aspp_head.py
│   │   │   │       │   │   ├── cascade_decode_head.py
│   │   │   │       │   │   ├── cc_head.py
│   │   │   │       │   │   ├── da_head.py
│   │   │   │       │   │   ├── decode_head.py
│   │   │   │       │   │   ├── dm_head.py
│   │   │   │       │   │   ├── dnl_head.py
│   │   │   │       │   │   ├── ema_head.py
│   │   │   │       │   │   ├── enc_head.py
│   │   │   │       │   │   ├── fcn_head.py
│   │   │   │       │   │   ├── fpn_head.py
│   │   │   │       │   │   ├── gc_head.py
│   │   │   │       │   │   ├── lraspp_head.py
│   │   │   │       │   │   ├── nl_head.py
│   │   │   │       │   │   ├── ocr_head.py
│   │   │   │       │   │   ├── point_head.py
│   │   │   │       │   │   ├── psa_head.py
│   │   │   │       │   │   ├── psp_head.py
│   │   │   │       │   │   ├── sep_aspp_head.py
│   │   │   │       │   │   ├── sep_fcn_head.py
│   │   │   │       │   │   └── uper_head.py
│   │   │   │       │   ├── losses/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── accuracy.py
│   │   │   │       │   │   ├── cross_entropy_loss.py
│   │   │   │       │   │   ├── dice_loss.py
│   │   │   │       │   │   ├── lovasz_loss.py
│   │   │   │       │   │   └── utils.py
│   │   │   │       │   ├── necks/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── fpn.py
│   │   │   │       │   │   └── multilevel_neck.py
│   │   │   │       │   ├── segmentors/
│   │   │   │       │   │   ├── __init__.py
│   │   │   │       │   │   ├── base.py
│   │   │   │       │   │   ├── cascade_encoder_decoder.py
│   │   │   │       │   │   └── encoder_decoder.py
│   │   │   │       │   └── utils/
│   │   │   │       │       ├── __init__.py
│   │   │   │       │       ├── drop.py
│   │   │   │       │       ├── inverted_residual.py
│   │   │   │       │       ├── make_divisible.py
│   │   │   │       │       ├── res_layer.py
│   │   │   │       │       ├── se_layer.py
│   │   │   │       │       ├── self_attention_block.py
│   │   │   │       │       ├── up_conv_block.py
│   │   │   │       │       └── weight_init.py
│   │   │   │       ├── ops/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── encoding.py
│   │   │   │       │   └── wrappers.py
│   │   │   │       └── utils/
│   │   │   │           ├── __init__.py
│   │   │   │           ├── collect_env.py
│   │   │   │           └── logger.py
│   │   │   └── util.py
│   │   ├── config.py
│   │   ├── dist_utils.py
│   │   ├── gradcam.py
│   │   ├── logger.py
│   │   ├── optims.py
│   │   ├── registry.py
│   │   ├── utils.py
│   │   └── vqa_tools/
│   │       ├── __init__.py
│   │       ├── vqa.py
│   │       └── vqa_eval.py
│   ├── configs/
│   │   ├── datasets/
│   │   │   ├── aokvqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── audiocaps/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   ├── defaults_mm_cap_instruct.yaml
│   │   │   │   └── defaults_mm_qa.yaml
│   │   │   ├── audioset/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── avsd/
│   │   │   │   ├── defaults_dial.yaml
│   │   │   │   └── defaults_mm_dial_instruct.yaml
│   │   │   ├── blip_diffusion_datasets/
│   │   │   │   └── defaults.yaml
│   │   │   ├── capfilt14m/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── charade/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── clotho/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   ├── defaults_mm_cap_instruct.yaml
│   │   │   │   └── defaults_mm_qa.yaml
│   │   │   ├── coco/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_ret.yaml
│   │   │   │   ├── defaults_vqa.yaml
│   │   │   │   ├── defaults_vqa_instruct.yaml
│   │   │   │   └── eval_vqa.yaml
│   │   │   ├── coin/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── conceptual_caption/
│   │   │   │   ├── defaults_12m.yaml
│   │   │   │   ├── defaults_12m_instruct.yaml
│   │   │   │   ├── defaults_3m.yaml
│   │   │   │   └── defaults_3m_instruct.yaml
│   │   │   ├── didemo/
│   │   │   │   └── defaults_ret.yaml
│   │   │   ├── discriminatory_reasoning/
│   │   │   │   ├── defaults_mm_audio_video.yaml
│   │   │   │   ├── defaults_mm_image_pc.yaml
│   │   │   │   └── discriminatory_dataset/
│   │   │   │       ├── audiocaps_discrn.json
│   │   │   │       └── objaverse_discrn.json
│   │   │   ├── esc50/
│   │   │   │   └── defaults_mm_cls.yaml
│   │   │   ├── flickr30k/
│   │   │   │   ├── defaults.yaml
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── gqa/
│   │   │   │   ├── balanced_testdev.yaml
│   │   │   │   ├── balanced_testdev_instruct.yaml
│   │   │   │   ├── balanced_val.yaml
│   │   │   │   ├── balanced_val_instruct.yaml
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── iconqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── imagenet/
│   │   │   │   └── defaults.yaml
│   │   │   ├── laion/
│   │   │   │   ├── defaults_2B_multi.yaml
│   │   │   │   ├── defaults_400M.yaml
│   │   │   │   └── defaults_400M_instruct.yaml
│   │   │   ├── llava150k/
│   │   │   │   └── defaults_dial.yaml
│   │   │   ├── modelnet40/
│   │   │   │   └── defaults_cls.yaml
│   │   │   ├── msrvtt/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_qa.yaml
│   │   │   │   ├── defaults_qa_instruct.yaml
│   │   │   │   └── defaults_ret.yaml
│   │   │   ├── msvd/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_qa.yaml
│   │   │   │   └── defaults_qa_instruct.yaml
│   │   │   ├── music_avqa/
│   │   │   │   ├── defaults_mm_qa.yaml
│   │   │   │   └── defaults_mm_qa_instruct.yaml
│   │   │   ├── nlvr/
│   │   │   │   └── defaults.yaml
│   │   │   ├── nocaps/
│   │   │   │   └── defaults.yaml
│   │   │   ├── objaverse/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   ├── defaults_mm_cap_instruct.yaml
│   │   │   │   └── defaults_mm_qa.yaml
│   │   │   ├── ocrvqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── okvqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── sbu_caption/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── scienceqa/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── shapenet/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── snli_ve/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── textcaps/
│   │   │   │   ├── defaults.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── valor/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── vatex/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── vg/
│   │   │   │   ├── defaults_caption.yaml
│   │   │   │   ├── defaults_caption_instruct.yaml
│   │   │   │   ├── defaults_vqa.yaml
│   │   │   │   └── defaults_vqa_instruct.yaml
│   │   │   ├── violin/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   ├── defaults_cap_instruct.yaml
│   │   │   │   ├── defaults_entail.yaml
│   │   │   │   └── defaults_entail_instruct.yaml
│   │   │   ├── visdial/
│   │   │   │   ├── defaults_dial.yaml
│   │   │   │   └── defaults_dial_instruct.yaml
│   │   │   ├── vizwiz/
│   │   │   │   └── defaults.yaml
│   │   │   ├── vlep/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── vsr/
│   │   │   │   ├── defaults.yaml
│   │   │   │   ├── defaults_classification.yaml
│   │   │   │   ├── defaults_classification_instruct.yaml
│   │   │   │   └── defaults_instruct.yaml
│   │   │   ├── wavcaps/
│   │   │   │   ├── defaults_mm_cap.yaml
│   │   │   │   └── defaults_mm_cap_instruct.yaml
│   │   │   ├── webvid/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   ├── youcook/
│   │   │   │   ├── defaults_cap.yaml
│   │   │   │   └── defaults_cap_instruct.yaml
│   │   │   └── yt8m/
│   │   │       └── defaults_mm_dial.yaml
│   │   ├── default.yaml
│   │   └── models/
│   │       ├── albef_classification_ve.yaml
│   │       ├── albef_feature_extractor.yaml
│   │       ├── albef_nlvr.yaml
│   │       ├── albef_pretrain_base.yaml
│   │       ├── albef_retrieval_coco.yaml
│   │       ├── albef_retrieval_flickr.yaml
│   │       ├── albef_vqav2.yaml
│   │       ├── alpro_qa_msrvtt.yaml
│   │       ├── alpro_qa_msvd.yaml
│   │       ├── alpro_retrieval_didemo.yaml
│   │       ├── alpro_retrieval_msrvtt.yaml
│   │       ├── bert_config.json
│   │       ├── bert_config_alpro.json
│   │       ├── blip-diffusion/
│   │       │   ├── blip_diffusion_base.yaml
│   │       │   ├── blip_diffusion_controlnet_canny.yaml
│   │       │   ├── blip_diffusion_controlnet_depth.yaml
│   │       │   └── blip_diffusion_controlnet_hed.yaml
│   │       ├── blip2/
│   │       │   ├── blip2_caption_flant5xl.yaml
│   │       │   ├── blip2_caption_opt2.7b.yaml
│   │       │   ├── blip2_caption_opt6.7b.yaml
│   │       │   ├── blip2_coco.yaml
│   │       │   ├── blip2_instruct_flant5xl.yaml
│   │       │   ├── blip2_instruct_flant5xxl.yaml
│   │       │   ├── blip2_instruct_vicuna13b.yaml
│   │       │   ├── blip2_instruct_vicuna7b.yaml
│   │       │   ├── blip2_pretrain.yaml
│   │       │   ├── blip2_pretrain_flant5xl.yaml
│   │       │   ├── blip2_pretrain_flant5xl_vitL.yaml
│   │       │   ├── blip2_pretrain_flant5xxl.yaml
│   │       │   ├── blip2_pretrain_llama7b.yaml
│   │       │   ├── blip2_pretrain_opt2.7b.yaml
│   │       │   ├── blip2_pretrain_opt6.7b.yaml
│   │       │   ├── blip2_pretrain_vitL.yaml
│   │       │   ├── blip2_xinstruct_vicuna13b.yaml
│   │       │   └── blip2_xinstruct_vicuna7b.yaml
│   │       ├── blip_caption_base_coco.yaml
│   │       ├── blip_caption_large_coco.yaml
│   │       ├── blip_classification_base.yaml
│   │       ├── blip_feature_extractor_base.yaml
│   │       ├── blip_itm_base.yaml
│   │       ├── blip_itm_large.yaml
│   │       ├── blip_nlvr.yaml
│   │       ├── blip_pretrain_base.yaml
│   │       ├── blip_pretrain_large.yaml
│   │       ├── blip_retrieval_coco.yaml
│   │       ├── blip_retrieval_flickr.yaml
│   │       ├── blip_vqa_aokvqa.yaml
│   │       ├── blip_vqa_okvqa.yaml
│   │       ├── blip_vqav2.yaml
│   │       ├── clip/
│   │       │   ├── RN101-quickgelu.json
│   │       │   ├── RN101.json
│   │       │   ├── RN50-quickgelu.json
│   │       │   ├── RN50.json
│   │       │   ├── RN50x16.json
│   │       │   ├── RN50x4.json
│   │       │   ├── ViT-B-16-plus-240.json
│   │       │   ├── ViT-B-16-plus.json
│   │       │   ├── ViT-B-16.json
│   │       │   ├── ViT-B-32-plus-256.json
│   │       │   ├── ViT-B-32-quickgelu.json
│   │       │   ├── ViT-B-32.json
│   │       │   ├── ViT-H-14.json
│   │       │   ├── ViT-H-16.json
│   │       │   ├── ViT-L-14-280.json
│   │       │   ├── ViT-L-14-336.json
│   │       │   ├── ViT-L-14.json
│   │       │   ├── ViT-L-16-320.json
│   │       │   ├── ViT-L-16.json
│   │       │   ├── ViT-g-14.json
│   │       │   ├── timm-efficientnetv2_rw_s.json
│   │       │   ├── timm-resnet50d.json
│   │       │   ├── timm-resnetaa50d.json
│   │       │   ├── timm-resnetblur50.json
│   │       │   ├── timm-swin_base_patch4_window7_224.json
│   │       │   ├── timm-vit_base_patch16_224.json
│   │       │   ├── timm-vit_base_patch32_224.json
│   │       │   └── timm-vit_small_patch16_224.json
│   │       ├── clip_resnet50.yaml
│   │       ├── clip_vit_base16.yaml
│   │       ├── clip_vit_base32.yaml
│   │       ├── clip_vit_large14.yaml
│   │       ├── clip_vit_large14_336.yaml
│   │       ├── gpt_dialogue_base.yaml
│   │       ├── img2prompt-vqa/
│   │       │   └── img2prompt_vqa_base.yaml
│   │       ├── med_config.json
│   │       ├── med_config_albef.json
│   │       ├── med_large_config.json
│   │       └── pnp-vqa/
│   │           ├── pnp_vqa_3b.yaml
│   │           ├── pnp_vqa_base.yaml
│   │           ├── pnp_vqa_large.yaml
│   │           ├── unifiedqav2_3b_config.json
│   │           ├── unifiedqav2_base_config.json
│   │           └── unifiedqav2_large_config.json
│   ├── datasets/
│   │   ├── builders/
│   │   │   ├── __init__.py
│   │   │   ├── audio_caption_builder.py
│   │   │   ├── audio_qa_builder.py
│   │   │   ├── base_dataset_builder.py
│   │   │   ├── caption_builder.py
│   │   │   ├── classification_builder.py
│   │   │   ├── dialogue_builder.py
│   │   │   ├── discrn_builders.py
│   │   │   ├── image_text_pair_builder.py
│   │   │   ├── imagefolder_builder.py
│   │   │   ├── object3d_caption_builder.py
│   │   │   ├── object3d_classification_builder.py
│   │   │   ├── object3d_qa_builder.py
│   │   │   ├── retrieval_builder.py
│   │   │   ├── text_to_image_generation_builder.py
│   │   │   ├── video_qa_builder.py
│   │   │   └── vqa_builder.py
│   │   ├── data_utils.py
│   │   ├── datasets/
│   │   │   ├── aok_vqa_datasets.py
│   │   │   ├── audio_captioning_datasets.py
│   │   │   ├── audio_classification_datasets.py
│   │   │   ├── audio_qa_datasets.py
│   │   │   ├── avsd_dialogue_datasets.py
│   │   │   ├── base_dataset.py
│   │   │   ├── capfilt_dataset.py
│   │   │   ├── caption_datasets.py
│   │   │   ├── coco_caption_datasets.py
│   │   │   ├── coco_vqa_datasets.py
│   │   │   ├── dataloader_utils.py
│   │   │   ├── dialogue_datasets.py
│   │   │   ├── discriminatory_reasoning_datasets.py
│   │   │   ├── gqa_datasets.py
│   │   │   ├── iconqa_datasets.py
│   │   │   ├── image_text_pair_datasets.py
│   │   │   ├── imagefolder_dataset.py
│   │   │   ├── laion_dataset.py
│   │   │   ├── llava150k_dataset.py
│   │   │   ├── multimodal_classification_datasets.py
│   │   │   ├── music_avqa.py
│   │   │   ├── nlvr_datasets.py
│   │   │   ├── object3d_captioning_datasets.py
│   │   │   ├── object3d_classification_datasets.py
│   │   │   ├── object3d_qa_datasets.py
│   │   │   ├── ocr_datasets.py
│   │   │   ├── retrieval_datasets.py
│   │   │   ├── snli_ve_datasets.py
│   │   │   ├── subject_driven_t2i_dataset.py
│   │   │   ├── textcaps_datasets.py
│   │   │   ├── valor_caption.py
│   │   │   ├── vatex_captioning_datasets.py
│   │   │   ├── vg_vqa_datasets.py
│   │   │   ├── video_caption_datasets.py
│   │   │   ├── video_vqa_datasets.py
│   │   │   ├── violin_dataset.py
│   │   │   ├── visdial_dialogue_datasets.py
│   │   │   ├── vizwiz_vqa_datasets.py
│   │   │   ├── vlep_dataset.py
│   │   │   ├── vqa_datasets.py
│   │   │   ├── vsr_datasets.py
│   │   │   └── yt8m_video_dialogue_datasets.py
│   │   └── download_scripts/
│   │       ├── DownloadConceptualCaptions/
│   │       │   ├── LICENSE
│   │       │   ├── README.md
│   │       │   ├── create_annotation_12m.ipynb
│   │       │   ├── create_annotation_3m.ipynb
│   │       │   ├── download_data_cc12m.py
│   │       │   └── download_data_cc3m.py
│   │       ├── download_charade.py
│   │       ├── download_coco.py
│   │       ├── download_coin.py
│   │       ├── download_didemo.py
│   │       ├── download_flickr.py
│   │       ├── download_gqa.py
│   │       ├── download_iconqa.py
│   │       ├── download_msrvtt.py
│   │       ├── download_msvd.py
│   │       ├── download_nocaps.py
│   │       ├── download_sbu.py
│   │       ├── download_vg.py
│   │       └── download_violin.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── albef_models/
│   │   │   ├── __init__.py
│   │   │   ├── albef_classification.py
│   │   │   ├── albef_feature_extractor.py
│   │   │   ├── albef_nlvr.py
│   │   │   ├── albef_outputs.py
│   │   │   ├── albef_pretrain.py
│   │   │   ├── albef_retrieval.py
│   │   │   └── albef_vqa.py
│   │   ├── alpro_models/
│   │   │   ├── __init__.py
│   │   │   ├── alpro_outputs.py
│   │   │   ├── alpro_qa.py
│   │   │   └── alpro_retrieval.py
│   │   ├── base_model.py
│   │   ├── beats/
│   │   │   ├── BEATs.py
│   │   │   ├── LICENSE_BEATs.txt
│   │   │   ├── README.md
│   │   │   ├── Tokenizers.py
│   │   │   ├── backbone.py
│   │   │   ├── modules.py
│   │   │   └── quantizer.py
│   │   ├── beats_encoder.py
│   │   ├── blip2_models/
│   │   │   ├── Qformer.py
│   │   │   ├── __init__.py
│   │   │   ├── blip2.py
│   │   │   ├── blip2_image_text_matching.py
│   │   │   ├── blip2_opt.py
│   │   │   ├── blip2_qformer.py
│   │   │   ├── blip2_t5.py
│   │   │   ├── blip2_t5_instruct.py
│   │   │   ├── blip2_vicuna_instruct.py
│   │   │   ├── blip2_vicuna_xinstruct.py
│   │   │   ├── modeling_llama.py
│   │   │   ├── modeling_opt.py
│   │   │   └── modeling_t5.py
│   │   ├── blip_diffusion_models/
│   │   │   ├── __init__.py
│   │   │   ├── blip_diffusion.py
│   │   │   ├── modeling_ctx_clip.py
│   │   │   ├── ptp_utils.py
│   │   │   └── utils.py
│   │   ├── blip_models/
│   │   │   ├── __init__.py
│   │   │   ├── blip.py
│   │   │   ├── blip_caption.py
│   │   │   ├── blip_classification.py
│   │   │   ├── blip_feature_extractor.py
│   │   │   ├── blip_image_text_matching.py
│   │   │   ├── blip_nlvr.py
│   │   │   ├── blip_outputs.py
│   │   │   ├── blip_pretrain.py
│   │   │   ├── blip_retrieval.py
│   │   │   ├── blip_vqa.py
│   │   │   └── nlvr_encoder.py
│   │   ├── clip_models/
│   │   │   ├── __init__.py
│   │   │   ├── clip_outputs.py
│   │   │   ├── loss.py
│   │   │   ├── model.py
│   │   │   ├── pretrained.py
│   │   │   ├── timm_model.py
│   │   │   ├── tokenizer.py
│   │   │   ├── transform.py
│   │   │   └── utils.py
│   │   ├── clip_vit.py
│   │   ├── eva_vit.py
│   │   ├── gpt_models/
│   │   │   └── gpt_dialogue.py
│   │   ├── img2prompt_models/
│   │   │   ├── __init__.py
│   │   │   └── img2prompt_vqa.py
│   │   ├── med.py
│   │   ├── pnp_vqa_models/
│   │   │   ├── __init__.py
│   │   │   ├── pnp_unifiedqav2_fid.py
│   │   │   └── pnp_vqa.py
│   │   ├── timesformer/
│   │   │   ├── __init__.py
│   │   │   ├── conv2d_same.py
│   │   │   ├── features.py
│   │   │   ├── helpers.py
│   │   │   ├── linear.py
│   │   │   ├── vit.py
│   │   │   └── vit_utils.py
│   │   ├── ulip_models/
│   │   │   ├── ULIP_models.py
│   │   │   ├── losses.py
│   │   │   ├── pointbert/
│   │   │   │   ├── PointTransformer_8192point.yaml
│   │   │   │   ├── checkpoint.py
│   │   │   │   ├── dvae.py
│   │   │   │   ├── logger.py
│   │   │   │   ├── misc.py
│   │   │   │   └── point_encoder.py
│   │   │   ├── ulip_scaled_up_config.yaml
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── build.py
│   │   │       ├── config.py
│   │   │       ├── io.py
│   │   │       ├── logger.py
│   │   │       ├── registry.py
│   │   │       ├── tokenizer.py
│   │   │       └── utils.py
│   │   └── vit.py
│   ├── processors/
│   │   ├── __init__.py
│   │   ├── alpro_processors.py
│   │   ├── audio_processors.py
│   │   ├── base_processor.py
│   │   ├── blip_diffusion_processors.py
│   │   ├── blip_processors.py
│   │   ├── clip_processors.py
│   │   ├── functional_video.py
│   │   ├── gpt_processors.py
│   │   ├── instruction_text_processors.py
│   │   ├── randaugment.py
│   │   ├── transforms_video.py
│   │   └── ulip_processors.py
│   ├── projects/
│   │   ├── albef/
│   │   │   ├── eval/
│   │   │   │   ├── nlvr_eval.yaml
│   │   │   │   ├── ret_coco_eval.yaml
│   │   │   │   ├── ret_flickr30k_eval.yaml
│   │   │   │   ├── snli_ve_eval.yaml
│   │   │   │   ├── vqa_test.yaml
│   │   │   │   └── vqa_val.yaml
│   │   │   └── train/
│   │   │       ├── aokvqa_ft.yaml
│   │   │       ├── nlvr_ft.yaml
│   │   │       ├── okvqa_ft.yaml
│   │   │       ├── pretrain.yaml
│   │   │       ├── ret_coco_ft.yaml
│   │   │       ├── ret_flickr30k_ft.yaml
│   │   │       ├── snli_ve_ft.yaml
│   │   │       └── vqa_ft.yaml
│   │   ├── alpro/
│   │   │   ├── eval/
│   │   │   │   ├── didemo_ret_eval.yaml
│   │   │   │   ├── msrvtt_qa_eval.yaml
│   │   │   │   ├── msrvtt_ret_eval.yaml
│   │   │   │   └── msvd_qa_eval.yaml
│   │   │   └── train/
│   │   │       ├── didemo_ret_ft.yaml
│   │   │       ├── msrvtt_qa_ft.yaml
│   │   │       ├── msrvtt_retrieval_ft.yaml
│   │   │       └── msvd_qa_ft.yaml
│   │   ├── blip/
│   │   │   ├── coco_cap_ft_iter.yaml
│   │   │   ├── eval/
│   │   │   │   ├── aokvqa_eval.yaml
│   │   │   │   ├── caption_coco_eval.yaml
│   │   │   │   ├── caption_coco_eval_large.yaml
│   │   │   │   ├── nlvr_eval.yaml
│   │   │   │   ├── nocaps_eval.yaml
│   │   │   │   ├── okvqa_eval.yaml
│   │   │   │   ├── ret_coco_eval.yaml
│   │   │   │   ├── ret_flickr_eval.yaml
│   │   │   │   └── vqav2_eval.yaml
│   │   │   └── train/
│   │   │       ├── aokvqa_ft.yaml
│   │   │       ├── caption_coco_ft.yaml
│   │   │       ├── caption_coco_large_ft.yaml
│   │   │       ├── nlvr_ft.yaml
│   │   │       ├── okvqa_ft.yaml
│   │   │       ├── pretrain_14m.yaml
│   │   │       ├── retrieval_coco_ft.yaml
│   │   │       ├── retrieval_flickr_ft.yaml
│   │   │       └── vqav2_ft.yaml
│   │   ├── blip2/
│   │   │   ├── eval/
│   │   │   │   ├── caption_coco_flant5xl_eval.yaml
│   │   │   │   ├── caption_coco_opt2.7b_eval.yaml
│   │   │   │   ├── caption_coco_opt6.7b_eval.yaml
│   │   │   │   ├── caption_nocaps_out_domain_flant5xl_eval.yaml
│   │   │   │   ├── caption_nocaps_out_domain_flant5xxl_eval.yaml
│   │   │   │   ├── gqa_zeroshot_flant5xl_eval.yaml
│   │   │   │   ├── okvqa_zeroshot_flant5xl_eval.yaml
│   │   │   │   ├── ret_coco_eval.yaml
│   │   │   │   ├── ret_flickr_eval.yaml
│   │   │   │   ├── vqav2_zeroshot_flant5xl_eval.yaml
│   │   │   │   └── vqav2_zeroshot_opt_eval.yaml
│   │   │   └── train/
│   │   │       ├── caption_coco_ft.yaml
│   │   │       ├── pretrain_stage1.yaml
│   │   │       ├── pretrain_stage2.yaml
│   │   │       └── retrieval_coco_ft.yaml
│   │   ├── blip_diffusion/
│   │   │   ├── finetune-db-dog.yaml
│   │   │   ├── finetune-db-pink-dress.yaml
│   │   │   ├── finetune-db-shein-jacket.yaml
│   │   │   └── finetune-db-template.yaml
│   │   ├── clip/
│   │   │   ├── exp_coco_ret_eval.yaml
│   │   │   ├── exp_flickr_ret_eval.yaml
│   │   │   └── exp_imnet_zs_eval.yaml
│   │   ├── gpt/
│   │   │   ├── eval/
│   │   │   │   └── dialogue_avsd_eval.yaml
│   │   │   └── train/
│   │   │       └── dialogue_avsd_ft.yaml
│   │   ├── instructblip/
│   │   │   ├── caption_coco_flant5xl_eval_test.yaml
│   │   │   ├── caption_coco_flant5xl_eval_val.yaml
│   │   │   ├── caption_coco_flant5xxl_eval_test.yaml
│   │   │   ├── caption_coco_flant5xxl_eval_val.yaml
│   │   │   ├── caption_coco_vicuna13b_eval_test.yaml
│   │   │   ├── caption_coco_vicuna13b_eval_val.yaml
│   │   │   ├── caption_coco_vicuna7b_eval_test.yaml
│   │   │   ├── caption_coco_vicuna7b_eval_val.yaml
│   │   │   ├── caption_msrvtt_flant5xl_eval_test.yaml
│   │   │   ├── caption_msrvtt_flant5xl_eval_val.yaml
│   │   │   ├── caption_msrvtt_flant5xxl_eval_test.yaml
│   │   │   ├── caption_msrvtt_flant5xxl_eval_val.yaml
│   │   │   ├── caption_msrvtt_vicuna13b_eval_test.yaml
│   │   │   ├── caption_msrvtt_vicuna13b_eval_val.yaml
│   │   │   ├── caption_msrvtt_vicuna7b_eval_test.yaml
│   │   │   ├── caption_msrvtt_vicuna7b_eval_val.yaml
│   │   │   ├── caption_msvd_flant5xl_eval.yaml
│   │   │   ├── caption_msvd_flant5xxl_eval.yaml
│   │   │   ├── caption_msvd_vicuna13b_eval.yaml
│   │   │   ├── caption_msvd_vicuna7b_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_flant5xl_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_flant5xxl_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_vicuna13b_eval.yaml
│   │   │   ├── caption_nocaps_out_domain_vicuna7b_eval.yaml
│   │   │   ├── caption_vatex_flant5xl_eval.yaml
│   │   │   ├── caption_vatex_flant5xxl_eval.yaml
│   │   │   ├── caption_vatex_vicuna13b_eval.yaml
│   │   │   ├── caption_vatex_vicuna7b_eval.yaml
│   │   │   ├── classification_modelnet40_vicuna13b.yaml
│   │   │   ├── classification_modelnet40_vicuna7b.yaml
│   │   │   ├── classification_snlive_flant5xl.yaml
│   │   │   ├── classification_snlive_flant5xxl.yaml
│   │   │   ├── classification_snlive_vicuna13b.yaml
│   │   │   ├── classification_snlive_vicuna13b_test.yaml
│   │   │   ├── classification_snlive_vicuna7b_test.yaml
│   │   │   ├── classification_snlive_vicuna7b_val.yaml
│   │   │   ├── completion_modelnet40_vicuna13b.yaml
│   │   │   ├── completion_modelnet40_vicuna7b.yaml
│   │   │   ├── qa_msrvtt_flant5xl_eval_test.yaml
│   │   │   ├── qa_msrvtt_flant5xxl_eval_test.yaml
│   │   │   ├── qa_msrvtt_vicuna13b_eval_test.yaml
│   │   │   ├── qa_msrvtt_vicuna7b_eval_test.yaml
│   │   │   ├── qa_msvd_flant5xl_eval.yaml
│   │   │   ├── qa_msvd_flant5xxl_eval.yaml
│   │   │   ├── qa_msvd_vicuna13b_eval.yaml
│   │   │   ├── qa_msvd_vicuna7b_eval.yaml
│   │   │   ├── qa_okvqa_flant5xl_eval.yaml
│   │   │   ├── qa_okvqa_flant5xxl_eval.yaml
│   │   │   ├── qa_okvqa_vicuna13b_eval.yaml
│   │   │   └── qa_okvqa_vicuna7b_eval.yaml
│   │   ├── pnp-vqa/
│   │   │   └── eval/
│   │   │       ├── gqa_eval.yaml
│   │   │       ├── gqa_eval_3b.yaml
│   │   │       ├── gqa_eval_large.yaml
│   │   │       ├── okvqa_eval.yaml
│   │   │       ├── okvqa_eval_3b.yaml
│   │   │       ├── okvqa_eval_large.yaml
│   │   │       ├── vqav2_eval.yaml
│   │   │       ├── vqav2_eval_3b.yaml
│   │   │       ├── vqav2_eval_large.yaml
│   │   │       ├── vqav2_test_eval.yaml
│   │   │       ├── vqav2_test_eval_3b.yaml
│   │   │       └── vqav2_test_eval_large.yaml
│   │   └── xinstruct_blip/
│   │       ├── eval/
│   │       │   ├── discrn/
│   │       │   │   ├── audio_video_caption.yaml
│   │       │   │   ├── audio_video_caption_13b.yaml
│   │       │   │   ├── audio_video_describe.yaml
│   │       │   │   ├── audio_video_describe_13b.yaml
│   │       │   │   ├── audio_video_describe_nocue.yaml
│   │       │   │   ├── audio_video_describe_proj copy.yaml
│   │       │   │   ├── audio_video_describe_proj.yaml
│   │       │   │   ├── audio_video_describe_rand_init.yaml
│   │       │   │   ├── image_3d_caption.yaml
│   │       │   │   ├── image_3d_caption_13b.yaml
│   │       │   │   ├── image_3d_describe.yaml
│   │       │   │   ├── image_3d_describe_13b.yaml
│   │       │   │   ├── image_3d_describe_no_init.yaml
│   │       │   │   ├── image_3d_describe_nocue.yaml
│   │       │   │   └── image_3d_describe_proj.yaml
│   │       │   ├── vicuna13b/
│   │       │   │   ├── audio/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── crossmodal/
│   │       │   │   │   ├── musicavqa/
│   │       │   │   │   │   ├── musicavqa_audio_eval.yaml
│   │       │   │   │   │   ├── musicavqa_joint_eval.yaml
│   │       │   │   │   │   └── musicavqa_video_eval.yaml
│   │       │   │   │   └── vatex/
│   │       │   │   │       ├── vatex_audio_captioning.yaml
│   │       │   │   │       ├── vatex_captioning.yaml
│   │       │   │   │       ├── vatex_joint_captioning.yaml
│   │       │   │   │       └── vatex_joint_captioning_interleave.yaml
│   │       │   │   ├── image/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_with_coco/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── pc/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── video/
│   │       │   │   │   ├── msrvtt_captioning.yaml
│   │       │   │   │   ├── msrvtt_captioning_test.yaml
│   │       │   │   │   ├── msrvtt_captioning_val.yaml
│   │       │   │   │   ├── msrvtt_qa_test.yaml
│   │       │   │   │   ├── msrvtt_qa_val.yaml
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   ├── vatex_audio_captioning.yaml
│   │       │   │   │   ├── vatex_captioning.yaml
│   │       │   │   │   ├── vatex_joint_captioning.yaml
│   │       │   │   │   └── vatex_joint_captioning_interleave.yaml
│   │       │   │   └── video_image/
│   │       │   │       ├── msvd_captioning.yaml
│   │       │   │       ├── msvd_qa.yaml
│   │       │   │       └── vatex_captioning.yaml
│   │       │   ├── vicuna7b/
│   │       │   │   ├── audio/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── audio_no_init/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── audio_projection_only/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── audio_projection_only_nocue/
│   │       │   │   │   ├── audiocaps_captioning_qa.yaml
│   │       │   │   │   ├── audiocaps_captioning_test.yaml
│   │       │   │   │   ├── audiocaps_captioning_val.yaml
│   │       │   │   │   ├── clothoQA_captioning.yaml
│   │       │   │   │   ├── clothov1_captioning.yaml
│   │       │   │   │   ├── clothov2_captioning.yaml
│   │       │   │   │   ├── esc50_classification.yaml
│   │       │   │   │   └── esc50_classification_completion.yaml
│   │       │   │   ├── crossmodal/
│   │       │   │   │   ├── musicavqa/
│   │       │   │   │   │   ├── musicavqa_audio_eval.yaml
│   │       │   │   │   │   ├── musicavqa_joint_eval.yaml
│   │       │   │   │   │   └── musicavqa_video_eval.yaml
│   │       │   │   │   └── vatex/
│   │       │   │   │       ├── vatex_audio_captioning.yaml
│   │       │   │   │       ├── vatex_captioning.yaml
│   │       │   │   │       ├── vatex_joint_captioning.yaml
│   │       │   │   │       └── vatex_joint_captioning_interleave.yaml
│   │       │   │   ├── image/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_full_init/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_no_init/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_pre_coco/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── image_projection_only/
│   │       │   │   │   ├── coco_captioning_test.yaml
│   │       │   │   │   ├── coco_captioning_val.yaml
│   │       │   │   │   ├── flickr30k_captioning.yaml
│   │       │   │   │   ├── gqa_qa.yaml
│   │       │   │   │   ├── gqa_qa_val.yaml
│   │       │   │   │   ├── nocaps_captioning.yaml
│   │       │   │   │   ├── nocaps_out_domain_captioning.yaml
│   │       │   │   │   ├── okvqa_qa.yaml
│   │       │   │   │   ├── snlive_classification_test.yaml
│   │       │   │   │   ├── snlive_classification_val.yaml
│   │       │   │   │   └── vizwiz_qa.yaml
│   │       │   │   ├── pc/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_no_init/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_projection_only/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip1/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip2_scaled_up/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip_objaverse/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip_objaverse_shapenet/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── pc_ulip_shapenet/
│   │       │   │   │   ├── modelnet40_classification.yaml
│   │       │   │   │   ├── modelnet40_completion.yaml
│   │       │   │   │   ├── objaverse_captioning.yaml
│   │       │   │   │   └── objaverse_qa.yaml
│   │       │   │   ├── video/
│   │       │   │   │   ├── msrvtt_captioning_test.yaml
│   │       │   │   │   ├── msrvtt_captioning_val.yaml
│   │       │   │   │   ├── msrvtt_qa_test.yaml
│   │       │   │   │   ├── msrvtt_qa_val.yaml
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   └── vatex_captioning.yaml
│   │       │   │   ├── video_image/
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   └── vatex_captioning.yaml
│   │       │   │   ├── video_image_pre_coco/
│   │       │   │   │   ├── msvd_captioning.yaml
│   │       │   │   │   ├── msvd_qa.yaml
│   │       │   │   │   └── vatex_captioning.yaml
│   │       │   │   └── video_no_upsample/
│   │       │   │       ├── msrvtt_captioning_test.yaml
│   │       │   │       ├── msrvtt_captioning_val.yaml
│   │       │   │       ├── msrvtt_qa_test.yaml
│   │       │   │       ├── msrvtt_qa_val.yaml
│   │       │   │       ├── msvd_captioning.yaml
│   │       │   │       ├── msvd_captioning_up.yaml
│   │       │   │       ├── msvd_qa.yaml
│   │       │   │       ├── msvd_qa_up.yaml
│   │       │   │       ├── vatex_captioning.yaml
│   │       │   │       └── vatex_captioning_up.yaml
│   │       │   └── vicuna7b_nocue/
│   │       │       ├── audio/
│   │       │       │   ├── audiocaps_captioning_qa.yaml
│   │       │       │   ├── audiocaps_captioning_test.yaml
│   │       │       │   ├── audiocaps_captioning_val.yaml
│   │       │       │   ├── clothoQA_captioning.yaml
│   │       │       │   ├── clothov1_captioning.yaml
│   │       │       │   ├── clothov2_captioning.yaml
│   │       │       │   ├── esc50_classification.yaml
│   │       │       │   └── esc50_classification_completion.yaml
│   │       │       ├── crossmodal/
│   │       │       │   ├── musicavqa/
│   │       │       │   │   ├── musicavqa_audio_eval.yaml
│   │       │       │   │   ├── musicavqa_joint_eval.yaml
│   │       │       │   │   └── musicavqa_video_eval.yaml
│   │       │       │   └── vatex/
│   │       │       │       ├── vatex_audio_captioning.yaml
│   │       │       │       ├── vatex_captioning.yaml
│   │       │       │       └── vatex_joint_captioning.yaml
│   │       │       ├── image/
│   │       │       │   ├── coco_captioning_test.yaml
│   │       │       │   ├── coco_captioning_val.yaml
│   │       │       │   ├── flickr30k_captioning.yaml
│   │       │       │   ├── gqa_qa.yaml
│   │       │       │   ├── nocaps_captioning.yaml
│   │       │       │   ├── nocaps_out_domain_captioning.yaml
│   │       │       │   ├── okvqa_qa.yaml
│   │       │       │   ├── snlive_classification_test.yaml
│   │       │       │   ├── snlive_classification_val.yaml
│   │       │       │   └── vizwiz_qa.yaml
│   │       │       ├── pc/
│   │       │       │   ├── modelnet40_classification.yaml
│   │       │       │   ├── modelnet40_completion.yaml
│   │       │       │   ├── objaverse_captioning.yaml
│   │       │       │   └── objaverse_qa.yaml
│   │       │       ├── video/
│   │       │       │   ├── msrvtt_captioning_test.yaml
│   │       │       │   ├── msrvtt_captioning_val.yaml
│   │       │       │   ├── msrvtt_qa_test.yaml
│   │       │       │   ├── msrvtt_qa_val.yaml
│   │       │       │   ├── msvd_captioning.yaml
│   │       │       │   ├── msvd_qa.yaml
│   │       │       │   └── vatex_captioning.yaml
│   │       │       └── video_image/
│   │       │           ├── msvd_captioning.yaml
│   │       │           ├── msvd_qa.yaml
│   │       │           └── vatex_captioning.yaml
│   │       ├── prompt_variation/
│   │       │   └── nocaps/
│   │       │       ├── instructblip/
│   │       │       │   ├── original.yaml
│   │       │       │   ├── template_1.yaml
│   │       │       │   ├── template_2.yaml
│   │       │       │   ├── template_3.yaml
│   │       │       │   ├── template_4.yaml
│   │       │       │   └── template_5.yaml
│   │       │       └── xinstructblip/
│   │       │           ├── template_1.yaml
│   │       │           ├── template_2.yaml
│   │       │           ├── template_3.yaml
│   │       │           ├── template_4.yaml
│   │       │           └── template_5.yaml
│   │       └── train/
│   │           ├── vicuna13b/
│   │           │   ├── audio_training.yaml
│   │           │   ├── audio_training_continue.yaml
│   │           │   ├── image_train.yaml
│   │           │   ├── image_train_continue.yaml
│   │           │   ├── pc_training.yaml
│   │           │   └── video_training.yaml
│   │           ├── vicuna7b/
│   │           │   ├── audio_training.yaml
│   │           │   ├── audio_training_improved.yaml
│   │           │   ├── audio_training_no_init.yaml
│   │           │   ├── audio_training_projection_only.yaml
│   │           │   ├── audio_training_projection_only_nocue.yaml
│   │           │   ├── image_train.yaml
│   │           │   ├── image_train_improved.yaml
│   │           │   ├── image_train_no_init.yaml
│   │           │   ├── image_train_projection_only.yaml
│   │           │   ├── lora_training.yaml
│   │           │   ├── pc_training.yaml
│   │           │   ├── pc_training_improved.yaml
│   │           │   ├── pc_training_no_init.yaml
│   │           │   ├── pc_training_projection_only.yaml
│   │           │   ├── pc_training_projection_only_nocue.yaml
│   │           │   ├── pc_training_scaled_up.yaml
│   │           │   ├── pc_training_ulip1.yaml
│   │           │   ├── pc_training_ulip2_objaverse_shapenet_k_1.yaml
│   │           │   ├── pc_training_ulip_objaverse.yaml
│   │           │   ├── pc_training_ulip_shapenet.yaml
│   │           │   ├── video_training.yaml
│   │           │   └── video_training_no_msrvtt_upsample.yaml
│   │           └── vicuna7b_nocue/
│   │               ├── audio_training.yaml
│   │               ├── image_train.yaml
│   │               ├── pc_training.yaml
│   │               └── video_training.yaml
│   ├── runners/
│   │   ├── __init__.py
│   │   ├── runner_base.py
│   │   └── runner_iter.py
│   └── tasks/
│       ├── __init__.py
│       ├── base_task.py
│       ├── captioning.py
│       ├── dialogue.py
│       ├── image_text_pretrain.py
│       ├── multimodal_classification.py
│       ├── retrieval.py
│       ├── text_to_image_generation.py
│       ├── vqa.py
│       └── vqa_reading_comprehension.py
├── projects/
│   ├── blip-diffusion/
│   │   ├── README.md
│   │   └── notebooks/
│   │       ├── editing_real_finetuned.ipynb
│   │       ├── editing_real_zeroshot.ipynb
│   │       ├── editing_synthetic_zeroshot.ipynb
│   │       ├── editing_tryon_zeroshot.ipynb
│   │       ├── generation_finetuned_dog.ipynb
│   │       ├── generation_zeroshot.ipynb
│   │       └── stylization.ipynb
│   ├── blip2/
│   │   └── README.md
│   ├── img2llm-vqa/
│   │   ├── README.md
│   │   ├── img2llm_vqa.ipynb
│   │   └── img2llm_vqa.py
│   ├── img2prompt-vqa/
│   │   └── README.md
│   ├── instructblip/
│   │   ├── README.md
│   │   └── run_demo.py
│   ├── pnp-vqa/
│   │   ├── README.md
│   │   └── pnp_vqa.ipynb
│   └── xinstructblip/
│       ├── README.md
│       ├── data_aug/
│       │   ├── 3d_qa_data_generation.py
│       │   └── audio_qa_data_generation.py
│       ├── demo/
│       │   ├── configs/
│       │   │   ├── vicuna13b.yaml
│       │   │   ├── vicuna7b.yaml
│       │   │   ├── vicuna7b_blip_init.yaml
│       │   │   ├── vicuna7b_no_init.yaml
│       │   │   ├── vicuna7b_nocue.yaml
│       │   │   ├── vicuna7b_projection.yaml
│       │   │   ├── vicuna7b_rand.yaml
│       │   │   └── vicuna7b_v2.yaml
│       │   ├── demo.ipynb
│       │   ├── examples/
│       │   │   └── point_cloud/
│       │   │       └── banana.glb
│       │   └── run_demo.py
│       ├── discrn/
│       │   ├── caption_baseline/
│       │   │   ├── predict_audio.py
│       │   │   ├── predict_image.py
│       │   │   ├── predict_pc.py
│       │   │   ├── predict_video.py
│       │   │   └── render_images.py
│       │   └── data_generation/
│       │       ├── audiocaps_video_audio.py
│       │       └── objaverse_img_3d.py
│       └── modelnet_baseline/
│           └── render_images.py
├── pyproject.toml
├── requirements.txt
├── run_scripts/
│   ├── albef/
│   │   ├── eval/
│   │   │   ├── eval_albef_nlvr.sh
│   │   │   ├── eval_albef_ve.sh
│   │   │   ├── eval_coco_retrieval.sh
│   │   │   ├── eval_flickr30k_retrieval.sh
│   │   │   ├── test_albef_vqa.sh
│   │   │   └── val_albef_vqa.sh
│   │   └── train/
│   │       ├── pretrain.sh
│   │       ├── train_aokvqa_albef.sh
│   │       ├── train_coco_retrieval_albef.sh
│   │       ├── train_flickr30k_retrieval_albef.sh
│   │       ├── train_nlvr_albef.sh
│   │       ├── train_okvqa_albef.sh
│   │       ├── train_ve_albef.sh
│   │       └── train_vqa_albef.sh
│   ├── alpro/
│   │   ├── eval/
│   │   │   ├── eval_didemo_ret.sh
│   │   │   ├── eval_msrvtt_qa.sh
│   │   │   ├── eval_msrvtt_ret.sh
│   │   │   └── eval_msvd_qa.sh
│   │   └── train/
│   │       ├── train_didemo_ret.sh
│   │       ├── train_msrvtt_qa.sh
│   │       ├── train_msrvtt_ret.sh
│   │       └── train_msvd_qa.sh
│   ├── blip/
│   │   ├── eval/
│   │   │   ├── eval_aokvqa.sh
│   │   │   ├── eval_coco_cap.sh
│   │   │   ├── eval_coco_cap_large.sh
│   │   │   ├── eval_nlvr.sh
│   │   │   ├── eval_nocaps.sh
│   │   │   ├── eval_okvqa.sh
│   │   │   ├── eval_ret_coco.sh
│   │   │   ├── eval_ret_flickr.sh
│   │   │   └── validate_vqa.sh
│   │   └── train/
│   │       ├── pretrain.sh
│   │       ├── train_aokvqa.sh
│   │       ├── train_caption_coco.sh
│   │       ├── train_caption_coco_large.sh
│   │       ├── train_caption_coco_large_iters.sh
│   │       ├── train_nlvr.sh
│   │       ├── train_okvqa.sh
│   │       ├── train_retrieval_coco.sh
│   │       ├── train_retrieval_flickr.sh
│   │       └── train_vqa.sh
│   ├── blip-diffusion/
│   │   ├── train_db.sh
│   │   ├── train_db_dog.sh
│   │   ├── train_db_jacket_s.sh
│   │   ├── train_db_pink_dress.sh
│   │   └── train_db_shein_jacket.sh
│   ├── blip2/
│   │   ├── eval/
│   │   │   ├── eval_cap_coco_flant5xl.sh
│   │   │   ├── eval_cap_coco_opt2.7b.sh
│   │   │   ├── eval_cap_coco_opt6.7b.sh
│   │   │   ├── eval_gqa_zeroshot_flant5xl.sh
│   │   │   ├── eval_okvqa_zeroshot_flant5xl.sh
│   │   │   ├── eval_ret_coco.sh
│   │   │   ├── eval_ret_flickr.sh
│   │   │   ├── validate_vqa_zeroshot_flant5xl.sh
│   │   │   └── validate_vqa_zeroshot_opt.sh
│   │   └── train/
│   │       ├── pretrain_stage1.sh
│   │       ├── pretrain_stage2.sh
│   │       ├── train_caption_coco.sh
│   │       └── train_retrieval_coco.sh
│   ├── clip/
│   │   └── eval/
│   │       ├── eval_clip_ret_coco.sh
│   │       ├── eval_clip_ret_flickr.sh
│   │       └── eval_clip_zs_imnet.sh
│   ├── gpt/
│   │   ├── eval/
│   │   │   └── eval_video_dialogue_avsd.sh
│   │   └── train/
│   │       └── train_video_dialogue_avsd.sh
│   ├── pnp-vqa/
│   │   └── eval/
│   │       ├── eval_gqa.sh
│   │       ├── eval_gqa_3b.sh
│   │       ├── eval_gqa_large.sh
│   │       ├── eval_okvqa.sh
│   │       ├── eval_okvqa_3b.sh
│   │       ├── eval_okvqa_large.sh
│   │       ├── eval_vqav2.sh
│   │       ├── eval_vqav2_3b.sh
│   │       ├── eval_vqav2_large.sh
│   │       ├── eval_vqav2_test.sh
│   │       ├── eval_vqav2_test_3b.sh
│   │       └── eval_vqav2_test_large.sh
│   ├── run_browser.sh
│   └── run_demo.sh
├── setup.py
├── tests/
│   └── models/
│       ├── test_albef.py
│       ├── test_blip.py
│       ├── test_blip2.py
│       └── test_pnp_vqa.py
└── train.py

Download .txt

Showing preview only (361K chars total). Download the full file or copy to clipboard to get everything.

SYMBOL INDEX (4547 symbols across 475 files)

FILE: app/__init__.py
  function load_demo_image (line 16) | def load_demo_image():

FILE: app/calculate_coco_features.py
  function load_demo_image (line 22) | def load_demo_image():
  function read_img (line 31) | def read_img(filepath):

FILE: app/caption.py
  function app (line 15) | def app():
  function generate_caption (line 72) | def generate_caption(

FILE: app/classification.py
  function load_demo_image (line 23) | def load_demo_image(img_url=None):
  function load_model_cache (line 38) | def load_model_cache(model_type, device):
  function app (line 63) | def app():

FILE: app/dataset_browser.py
  function sample_dataset (line 26) | def sample_dataset(dataset, indices):
  function get_concat_v (line 32) | def get_concat_v(im1, im2):
  function resize_img_w (line 43) | def resize_img_w(raw_img, new_w=224):
  function get_visual_key (line 58) | def get_visual_key(dataset):
  function gather_items (line 69) | def gather_items(samples, exclude=[]):
  function load_dataset_cache (line 84) | def load_dataset_cache(name):
  function format_text (line 88) | def format_text(text):
  function show_samples (line 94) | def show_samples(dataset, offset=0, is_next=False):

FILE: app/image_text_match.py
  function app (line 19) | def app():

FILE: app/multimodal_search.py
  function load_feat (line 34) | def load_feat():
  function load_feature_extractor_model (line 61) | def load_feature_extractor_model(device):
  function app (line 72) | def app():
  function read_and_process_images (line 183) | def read_and_process_images(image_paths, vis_processor):
  function compute_gradcam_batch (line 191) | def compute_gradcam_batch(model, visual_input, text_input, tokenized_tex...

FILE: app/multipage.py
  class MultiPage (line 17) | class MultiPage:
    method __init__ (line 20) | def __init__(self) -> None:
    method add_page (line 24) | def add_page(self, title, func) -> None:
    method run (line 34) | def run(self):

FILE: app/text_localization.py
  function app (line 20) | def app():

FILE: app/utils.py
  function resize_img (line 18) | def resize_img(raw_img):
  function read_img (line 25) | def read_img(filepath):
  function load_model_cache (line 39) | def load_model_cache(name, model_type, is_eval, device):
  function init_bert_tokenizer (line 44) | def init_bert_tokenizer():
  function getAttMap (line 49) | def getAttMap(img, attMap, blur=True, overlap=True):
  function load_blip_itm_model (line 77) | def load_blip_itm_model(device, model_type="base"):

FILE: app/vqa.py
  function app (line 15) | def app():

FILE: evaluate.py
  function parse_args (line 33) | def parse_args():
  function setup_seeds (line 52) | def setup_seeds(config):
  function main (line 63) | def main():

FILE: lavis/common/annotator/canny/__init__.py
  class CannyDetector (line 4) | class CannyDetector:
    method __call__ (line 5) | def __call__(self, img, low_threshold, high_threshold):

FILE: lavis/common/annotator/hed/__init__.py
  class Network (line 9) | class Network(torch.nn.Module):
    method __init__ (line 10) | def __init__(self, model_path):
    method forward (line 71) | def forward(self, tenInput):
  class HEDdetector (line 96) | class HEDdetector:
    method __init__ (line 97) | def __init__(self):
    method __call__ (line 105) | def __call__(self, input_image):
  function nms (line 117) | def nms(x, t, s):

FILE: lavis/common/annotator/midas/__init__.py
  class MidasDetector (line 9) | class MidasDetector:
    method __init__ (line 10) | def __init__(self):
    method __call__ (line 13) | def __call__(self, input_image, a=np.pi * 2.0, bg_th=0.1):

FILE: lavis/common/annotator/midas/api.py
  function disabled_train (line 26) | def disabled_train(self, mode=True):
  function load_midas_transform (line 32) | def load_midas_transform(model_type):
  function load_model (line 77) | def load_model(model_type):
  class MiDaSInference (line 145) | class MiDaSInference(nn.Module):
    method __init__ (line 158) | def __init__(self, model_type):
    method forward (line 165) | def forward(self, x):

FILE: lavis/common/annotator/midas/midas/base_model.py
  class BaseModel (line 4) | class BaseModel(torch.nn.Module):
    method load (line 5) | def load(self, path):

FILE: lavis/common/annotator/midas/midas/blocks.py
  function _make_encoder (line 11) | def _make_encoder(backbone, features, use_pretrained, groups=1, expand=F...
  function _make_scratch (line 49) | def _make_scratch(in_shape, out_shape, groups=1, expand=False):
  function _make_pretrained_efficientnet_lite3 (line 78) | def _make_pretrained_efficientnet_lite3(use_pretrained, exportable=False):
  function _make_efficientnet_backbone (line 88) | def _make_efficientnet_backbone(effnet):
  function _make_resnet_backbone (line 101) | def _make_resnet_backbone(resnet):
  function _make_pretrained_resnext101_wsl (line 114) | def _make_pretrained_resnext101_wsl(use_pretrained):
  class Interpolate (line 120) | class Interpolate(nn.Module):
    method __init__ (line 124) | def __init__(self, scale_factor, mode, align_corners=False):
    method forward (line 138) | def forward(self, x):
  class ResidualConvUnit (line 155) | class ResidualConvUnit(nn.Module):
    method __init__ (line 159) | def __init__(self, features):
    method forward (line 177) | def forward(self, x):
  class FeatureFusionBlock (line 194) | class FeatureFusionBlock(nn.Module):
    method __init__ (line 198) | def __init__(self, features):
    method forward (line 209) | def forward(self, *xs):
  class ResidualConvUnit_custom (line 231) | class ResidualConvUnit_custom(nn.Module):
    method __init__ (line 235) | def __init__(self, features, activation, bn):
    method forward (line 263) | def forward(self, x):
  class FeatureFusionBlock_custom (line 291) | class FeatureFusionBlock_custom(nn.Module):
    method __init__ (line 295) | def __init__(self, features, activation, deconv=False, bn=False, expan...
    method forward (line 320) | def forward(self, *xs):

FILE: lavis/common/annotator/midas/midas/dpt_depth.py
  function _make_fusion_block (line 15) | def _make_fusion_block(features, use_bn):
  class DPT (line 26) | class DPT(BaseModel):
    method __init__ (line 27) | def __init__(
    method forward (line 67) | def forward(self, x):
  class DPTDepthModel (line 88) | class DPTDepthModel(DPT):
    method __init__ (line 89) | def __init__(self, path=None, non_negative=True, **kwargs):
    method forward (line 107) | def forward(self, x):

FILE: lavis/common/annotator/midas/midas/midas_net.py
  class MidasNet (line 12) | class MidasNet(BaseModel):
    method __init__ (line 16) | def __init__(self, path=None, features=256, non_negative=True):
    method forward (line 49) | def forward(self, x):

FILE: lavis/common/annotator/midas/midas/midas_net_custom.py
  class MidasNet_small (line 12) | class MidasNet_small(BaseModel):
    method __init__ (line 16) | def __init__(self, path=None, features=64, backbone="efficientnet_lite...
    method forward (line 73) | def forward(self, x):
  function fuse_model (line 109) | def fuse_model(m):

FILE: lavis/common/annotator/midas/midas/transforms.py
  function apply_min_size (line 6) | def apply_min_size(sample, size, image_interpolation_method=cv2.INTER_AR...
  class Resize (line 48) | class Resize(object):
    method __init__ (line 52) | def __init__(
    method constrain_to_multiple_of (line 94) | def constrain_to_multiple_of(self, x, min_val=0, max_val=None):
    method get_size (line 105) | def get_size(self, width, height):
    method __call__ (line 162) | def __call__(self, sample):
  class NormalizeImage (line 197) | class NormalizeImage(object):
    method __init__ (line 201) | def __init__(self, mean, std):
    method __call__ (line 205) | def __call__(self, sample):
  class PrepareForNet (line 211) | class PrepareForNet(object):
    method __init__ (line 215) | def __init__(self):
    method __call__ (line 218) | def __call__(self, sample):

FILE: lavis/common/annotator/midas/midas/vit.py
  class Slice (line 9) | class Slice(nn.Module):
    method __init__ (line 10) | def __init__(self, start_index=1):
    method forward (line 14) | def forward(self, x):
  class AddReadout (line 18) | class AddReadout(nn.Module):
    method __init__ (line 19) | def __init__(self, start_index=1):
    method forward (line 23) | def forward(self, x):
  class ProjectReadout (line 31) | class ProjectReadout(nn.Module):
    method __init__ (line 32) | def __init__(self, in_features, start_index=1):
    method forward (line 38) | def forward(self, x):
  class Transpose (line 45) | class Transpose(nn.Module):
    method __init__ (line 46) | def __init__(self, dim0, dim1):
    method forward (line 51) | def forward(self, x):
  function forward_vit (line 56) | def forward_vit(pretrained, x):
  function _resize_pos_embed (line 100) | def _resize_pos_embed(self, posemb, gs_h, gs_w):
  function forward_flex (line 117) | def forward_flex(self, x):
  function get_activation (line 159) | def get_activation(name):
  function get_readout_oper (line 166) | def get_readout_oper(vit_features, features, use_readout, start_index=1):
  function _make_vit_b16_backbone (line 183) | def _make_vit_b16_backbone(
  function _make_pretrained_vitl16_384 (line 297) | def _make_pretrained_vitl16_384(pretrained, use_readout="ignore", hooks=...
  function _make_pretrained_vitb16_384 (line 310) | def _make_pretrained_vitb16_384(pretrained, use_readout="ignore", hooks=...
  function _make_pretrained_deitb16_384 (line 319) | def _make_pretrained_deitb16_384(pretrained, use_readout="ignore", hooks...
  function _make_pretrained_deitb16_distil_384 (line 328) | def _make_pretrained_deitb16_distil_384(pretrained, use_readout="ignore"...
  function _make_vit_b_rn50_backbone (line 343) | def _make_vit_b_rn50_backbone(
  function _make_pretrained_vitb_rn50_384 (line 478) | def _make_pretrained_vitb_rn50_384(

FILE: lavis/common/annotator/midas/utils.py
  function read_pfm (line 9) | def read_pfm(path):
  function write_pfm (line 58) | def write_pfm(path, image, scale=1):
  function read_image (line 97) | def read_image(path):
  function resize_image (line 116) | def resize_image(img):
  function resize_depth (line 146) | def resize_depth(depth, width, height):
  function write_depth (line 165) | def write_depth(path, depth, bits=1):

FILE: lavis/common/annotator/mlsd/__init__.py
  class MLSDdetector (line 17) | class MLSDdetector:
    method __init__ (line 18) | def __init__(self):
    method __call__ (line 27) | def __call__(self, input_image, thr_v, thr_d):

FILE: lavis/common/annotator/mlsd/models/mbv2_mlsd_large.py
  class BlockTypeA (line 9) | class BlockTypeA(nn.Module):
    method __init__ (line 10) | def __init__(self, in_c1, in_c2, out_c1, out_c2, upscale = True):
    method forward (line 24) | def forward(self, a, b):
  class BlockTypeB (line 32) | class BlockTypeB(nn.Module):
    method __init__ (line 33) | def __init__(self, in_c, out_c):
    method forward (line 46) | def forward(self, x):
  class BlockTypeC (line 51) | class BlockTypeC(nn.Module):
    method __init__ (line 52) | def __init__(self, in_c, out_c):
    method forward (line 66) | def forward(self, x):
  function _make_divisible (line 72) | def _make_divisible(v, divisor, min_value=None):
  class ConvBNReLU (line 92) | class ConvBNReLU(nn.Sequential):
    method __init__ (line 93) | def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, gro...
    method forward (line 112) | def forward(self, x):
  class InvertedResidual (line 124) | class InvertedResidual(nn.Module):
    method __init__ (line 125) | def __init__(self, inp, oup, stride, expand_ratio):
    method forward (line 146) | def forward(self, x):
  class MobileNetV2 (line 153) | class MobileNetV2(nn.Module):
    method __init__ (line 154) | def __init__(self, pretrained=True):
    method _forward_impl (line 218) | def _forward_impl(self, x):
    method forward (line 233) | def forward(self, x):
    method _load_pretrained_model (line 236) | def _load_pretrained_model(self):
  class MobileV2_MLSD_Large (line 247) | class MobileV2_MLSD_Large(nn.Module):
    method __init__ (line 248) | def __init__(self):
    method forward (line 275) | def forward(self, x):

FILE: lavis/common/annotator/mlsd/models/mbv2_mlsd_tiny.py
  class BlockTypeA (line 9) | class BlockTypeA(nn.Module):
    method __init__ (line 10) | def __init__(self, in_c1, in_c2, out_c1, out_c2, upscale = True):
    method forward (line 24) | def forward(self, a, b):
  class BlockTypeB (line 31) | class BlockTypeB(nn.Module):
    method __init__ (line 32) | def __init__(self, in_c, out_c):
    method forward (line 45) | def forward(self, x):
  class BlockTypeC (line 50) | class BlockTypeC(nn.Module):
    method __init__ (line 51) | def __init__(self, in_c, out_c):
    method forward (line 65) | def forward(self, x):
  function _make_divisible (line 71) | def _make_divisible(v, divisor, min_value=None):
  class ConvBNReLU (line 91) | class ConvBNReLU(nn.Sequential):
    method __init__ (line 92) | def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, gro...
    method forward (line 111) | def forward(self, x):
  class InvertedResidual (line 123) | class InvertedResidual(nn.Module):
    method __init__ (line 124) | def __init__(self, inp, oup, stride, expand_ratio):
    method forward (line 145) | def forward(self, x):
  class MobileNetV2 (line 152) | class MobileNetV2(nn.Module):
    method __init__ (line 153) | def __init__(self, pretrained=True):
    method _forward_impl (line 218) | def _forward_impl(self, x):
    method forward (line 233) | def forward(self, x):
    method _load_pretrained_model (line 236) | def _load_pretrained_model(self):
  class MobileV2_MLSD_Tiny (line 247) | class MobileV2_MLSD_Tiny(nn.Module):
    method __init__ (line 248) | def __init__(self):
    method forward (line 263) | def forward(self, x):

FILE: lavis/common/annotator/mlsd/utils.py
  function deccode_output_score_and_ptss (line 19) | def deccode_output_score_and_ptss(tpMap, topk_n = 200, ksize = 5):
  function pred_lines (line 47) | def pred_lines(image, model,
  function pred_squares (line 89) | def pred_squares(image,

FILE: lavis/common/annotator/openpose/__init__.py
  class OpenposeDetector (line 16) | class OpenposeDetector:
    method __init__ (line 17) | def __init__(self):
    method __call__ (line 29) | def __call__(self, oriImg, hand=False):

FILE: lavis/common/annotator/openpose/body.py
  class Body (line 14) | class Body(object):
    method __init__ (line 15) | def __init__(self, model_path):
    method __call__ (line 24) | def __call__(self, oriImg):

FILE: lavis/common/annotator/openpose/hand.py
  class Hand (line 15) | class Hand(object):
    method __init__ (line 16) | def __init__(self, model_path):
    method __call__ (line 25) | def __call__(self, oriImg):

FILE: lavis/common/annotator/openpose/model.py
  function make_layers (line 7) | def make_layers(block, no_relu_layers):
  class bodypose_model (line 24) | class bodypose_model(nn.Module):
    method __init__ (line 25) | def __init__(self):
    method forward (line 114) | def forward(self, x):
  class handpose_model (line 143) | class handpose_model(nn.Module):
    method __init__ (line 144) | def __init__(self):
    method forward (line 204) | def forward(self, x):

FILE: lavis/common/annotator/openpose/util.py
  function padRightDownCorner (line 7) | def padRightDownCorner(img, stride, padValue):
  function transfer (line 30) | def transfer(model, model_weights):
  function draw_bodypose (line 37) | def draw_bodypose(canvas, candidate, subset):
  function draw_handpose (line 74) | def draw_handpose(canvas, all_hand_peaks, show_number=False):
  function handDetect (line 94) | def handDetect(candidate, subset, oriImg):
  function npmax (line 159) | def npmax(array):

FILE: lavis/common/annotator/uniformer/__init__.py
  class UniformerDetector (line 11) | class UniformerDetector:
    method __init__ (line 12) | def __init__(self):
    method __call__ (line 20) | def __call__(self, img):

FILE: lavis/common/annotator/uniformer/mmcv/arraymisc/quantization.py
  function quantize (line 5) | def quantize(arr, min_val, max_val, levels, dtype=np.int64):
  function dequantize (line 32) | def dequantize(arr, min_val, max_val, levels, dtype=np.float64):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/alexnet.py
  class AlexNet (line 7) | class AlexNet(nn.Module):
    method __init__ (line 14) | def __init__(self, num_classes=-1):
    method init_weights (line 43) | def init_weights(self, pretrained=None):
    method forward (line 54) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/activation.py
  class Clamp (line 18) | class Clamp(nn.Module):
    method __init__ (line 31) | def __init__(self, min=-1., max=1.):
    method forward (line 36) | def forward(self, x):
  class GELU (line 48) | class GELU(nn.Module):
    method forward (line 70) | def forward(self, input):
  function build_activation_layer (line 81) | def build_activation_layer(cfg):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/context_block.py
  function last_zero_init (line 9) | def last_zero_init(m):
  class ContextBlock (line 17) | class ContextBlock(nn.Module):
    method __init__ (line 36) | def __init__(self,
    method reset_parameters (line 75) | def reset_parameters(self):
    method spatial_pool (line 85) | def spatial_pool(self, x):
    method forward (line 111) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv.py
  function build_conv_layer (line 12) | def build_conv_layer(cfg, *args, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv2d_adaptive_padding.py
  class Conv2dAdaptivePadding (line 11) | class Conv2dAdaptivePadding(nn.Conv2d):
    method __init__ (line 33) | def __init__(self,
    method forward (line 45) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv_module.py
  class ConvModule (line 16) | class ConvModule(nn.Module):
    method __init__ (line 70) | def __init__(self,
    method norm (line 169) | def norm(self):
    method init_weights (line 175) | def init_weights(self):
    method forward (line 196) | def forward(self, x, activate=True, norm=True):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv_ws.py
  function conv_ws_2d (line 9) | def conv_ws_2d(input,
  class ConvWS2d (line 26) | class ConvWS2d(nn.Conv2d):
    method __init__ (line 28) | def __init__(self,
    method forward (line 49) | def forward(self, x):
  class ConvAWS2d (line 55) | class ConvAWS2d(nn.Conv2d):
    method __init__ (line 78) | def __init__(self,
    method _get_weight (line 101) | def _get_weight(self, weight):
    method forward (line 109) | def forward(self, x):
    method _load_from_state_dict (line 114) | def _load_from_state_dict(self, state_dict, prefix, local_metadata, st...

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/depthwise_separable_conv_module.py
  class DepthwiseSeparableConvModule (line 7) | class DepthwiseSeparableConvModule(nn.Module):
    method __init__ (line 48) | def __init__(self,
    method forward (line 93) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/drop.py
  function drop_path (line 9) | def drop_path(x, drop_prob=0., training=False):
  class DropPath (line 28) | class DropPath(nn.Module):
    method __init__ (line 39) | def __init__(self, drop_prob=0.1):
    method forward (line 43) | def forward(self, x):
  class Dropout (line 48) | class Dropout(nn.Dropout):
    method __init__ (line 59) | def __init__(self, drop_prob=0.5, inplace=False):
  function build_dropout (line 63) | def build_dropout(cfg, default_args=None):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/generalized_attention.py
  class GeneralizedAttention (line 14) | class GeneralizedAttention(nn.Module):
    method __init__ (line 47) | def __init__(self,
    method get_position_embedding (line 166) | def get_position_embedding(self,
    method forward (line 216) | def forward(self, x_input):
    method init_weights (line 403) | def init_weights(self):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/hsigmoid.py
  class HSigmoid (line 8) | class HSigmoid(nn.Module):
    method __init__ (line 23) | def __init__(self, bias=1.0, divisor=2.0, min_value=0.0, max_value=1.0):
    method forward (line 31) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/hswish.py
  class HSwish (line 8) | class HSwish(nn.Module):
    method __init__ (line 24) | def __init__(self, inplace=False):
    method forward (line 28) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/non_local.py
  class _NonLocalNd (line 12) | class _NonLocalNd(nn.Module, metaclass=ABCMeta):
    method __init__ (line 35) | def __init__(self,
    method init_weights (line 99) | def init_weights(self, std=0.01, zeros_init=True):
    method gaussian (line 116) | def gaussian(self, theta_x, phi_x):
    method embedded_gaussian (line 124) | def embedded_gaussian(self, theta_x, phi_x):
    method dot_product (line 135) | def dot_product(self, theta_x, phi_x):
    method concatenation (line 143) | def concatenation(self, theta_x, phi_x):
    method forward (line 160) | def forward(self, x):
  class NonLocal1d (line 214) | class NonLocal1d(_NonLocalNd):
    method __init__ (line 226) | def __init__(self,
  class NonLocal2d (line 246) | class NonLocal2d(_NonLocalNd):
    method __init__ (line 260) | def __init__(self,
  class NonLocal3d (line 279) | class NonLocal3d(_NonLocalNd):
    method __init__ (line 291) | def __init__(self,

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/norm.py
  function infer_abbr (line 23) | def infer_abbr(class_type):
  function build_norm_layer (line 72) | def build_norm_layer(cfg, num_features, postfix=''):
  function is_norm (line 122) | def is_norm(layer, exclude=None):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/padding.py
  function build_padding_layer (line 11) | def build_padding_layer(cfg, *args, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/plugin.py
  function infer_abbr (line 12) | def infer_abbr(class_type):
  function build_plugin_layer (line 55) | def build_plugin_layer(cfg, postfix='', **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/scale.py
  class Scale (line 6) | class Scale(nn.Module):
    method __init__ (line 16) | def __init__(self, scale=1.0):
    method forward (line 20) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/swish.py
  class Swish (line 9) | class Swish(nn.Module):
    method __init__ (line 21) | def __init__(self):
    method forward (line 24) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/transformer.py
  function build_positional_encoding (line 33) | def build_positional_encoding(cfg, default_args=None):
  function build_attention (line 38) | def build_attention(cfg, default_args=None):
  function build_feedforward_network (line 43) | def build_feedforward_network(cfg, default_args=None):
  function build_transformer_layer (line 48) | def build_transformer_layer(cfg, default_args=None):
  function build_transformer_layer_sequence (line 53) | def build_transformer_layer_sequence(cfg, default_args=None):
  class MultiheadAttention (line 59) | class MultiheadAttention(BaseModule):
    method __init__ (line 81) | def __init__(self,
    method forward (line 112) | def forward(self,
  class FFN (line 206) | class FFN(BaseModule):
    method __init__ (line 234) | def __init__(self,
    method forward (line 269) | def forward(self, x, identity=None):
  class BaseTransformerLayer (line 283) | class BaseTransformerLayer(BaseModule):
    method __init__ (line 319) | def __init__(self,
    method forward (line 412) | def forward(self,
  class TransformerLayerSequence (line 514) | class TransformerLayerSequence(BaseModule):
    method __init__ (line 533) | def __init__(self, transformerlayers=None, num_layers=None, init_cfg=N...
    method forward (line 549) | def forward(self,

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/upsample.py
  class PixelShufflePack (line 13) | class PixelShufflePack(nn.Module):
    method __init__ (line 27) | def __init__(self, in_channels, out_channels, scale_factor,
    method init_weights (line 41) | def init_weights(self):
    method forward (line 44) | def forward(self, x):
  function build_upsample_layer (line 50) | def build_upsample_layer(cfg, *args, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/bricks/wrappers.py
  function obsolete_torch_version (line 24) | def obsolete_torch_version(torch_version, version_threshold):
  class NewEmptyTensorOp (line 28) | class NewEmptyTensorOp(torch.autograd.Function):
    method forward (line 31) | def forward(ctx, x, new_shape):
    method backward (line 36) | def backward(ctx, grad):
  class Conv2d (line 42) | class Conv2d(nn.Conv2d):
    method forward (line 44) | def forward(self, x):
  class Conv3d (line 63) | class Conv3d(nn.Conv3d):
    method forward (line 65) | def forward(self, x):
  class ConvTranspose2d (line 86) | class ConvTranspose2d(nn.ConvTranspose2d):
    method forward (line 88) | def forward(self, x):
  class ConvTranspose3d (line 109) | class ConvTranspose3d(nn.ConvTranspose3d):
    method forward (line 111) | def forward(self, x):
  class MaxPool2d (line 129) | class MaxPool2d(nn.MaxPool2d):
    method forward (line 131) | def forward(self, x):
  class MaxPool3d (line 147) | class MaxPool3d(nn.MaxPool3d):
    method forward (line 149) | def forward(self, x):
  class Linear (line 166) | class Linear(torch.nn.Linear):
    method forward (line 168) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/builder.py
  function build_model_from_cfg (line 6) | def build_model_from_cfg(cfg, registry, default_args=None):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/resnet.py
  function conv3x3 (line 10) | def conv3x3(in_planes, out_planes, stride=1, dilation=1):
  class BasicBlock (line 22) | class BasicBlock(nn.Module):
    method __init__ (line 25) | def __init__(self,
    method forward (line 45) | def forward(self, x):
  class Bottleneck (line 64) | class Bottleneck(nn.Module):
    method __init__ (line 67) | def __init__(self,
    method forward (line 110) | def forward(self, x):
  function make_res_layer (line 143) | def make_res_layer(block,
  class ResNet (line 181) | class ResNet(nn.Module):
    method __init__ (line 210) | def __init__(self,
    method init_weights (line 265) | def init_weights(self, pretrained=None):
    method forward (line 279) | def forward(self, x):
    method train (line 295) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/utils/flops_counter.py
  function get_model_complexity_info (line 36) | def get_model_complexity_info(model,
  function flops_to_string (line 118) | def flops_to_string(flops, units='GFLOPs', precision=2):
  function params_to_string (line 161) | def params_to_string(num_params, units=None, precision=2):
  function print_model_with_flops (line 198) | def print_model_with_flops(model,
  function get_model_parameters_number (line 307) | def get_model_parameters_number(model):
  function add_flops_counting_methods (line 320) | def add_flops_counting_methods(net_main_module):
  function compute_average_flops_cost (line 337) | def compute_average_flops_cost(self):
  function start_flops_count (line 355) | def start_flops_count(self):
  function stop_flops_count (line 378) | def stop_flops_count(self):
  function reset_flops_count (line 389) | def reset_flops_count(self):
  function empty_flops_counter_hook (line 400) | def empty_flops_counter_hook(module, input, output):
  function upsample_flops_counter_hook (line 404) | def upsample_flops_counter_hook(module, input, output):
  function relu_flops_counter_hook (line 413) | def relu_flops_counter_hook(module, input, output):
  function linear_flops_counter_hook (line 418) | def linear_flops_counter_hook(module, input, output):
  function pool_flops_counter_hook (line 425) | def pool_flops_counter_hook(module, input, output):
  function norm_flops_counter_hook (line 430) | def norm_flops_counter_hook(module, input, output):
  function deconv_flops_counter_hook (line 440) | def deconv_flops_counter_hook(conv_module, input, output):
  function conv_flops_counter_hook (line 467) | def conv_flops_counter_hook(conv_module, input, output):
  function batch_counter_hook (line 498) | def batch_counter_hook(module, input, output):
  function add_batch_counter_variables_or_reset (line 511) | def add_batch_counter_variables_or_reset(module):
  function add_batch_counter_hook_function (line 516) | def add_batch_counter_hook_function(module):
  function remove_batch_counter_hook_function (line 524) | def remove_batch_counter_hook_function(module):
  function add_flops_counter_variable_or_reset (line 530) | def add_flops_counter_variable_or_reset(module):
  function is_supported_instance (line 540) | def is_supported_instance(module):
  function remove_flops_counter_hook_function (line 546) | def remove_flops_counter_hook_function(module):
  function get_modules_mapping (line 553) | def get_modules_mapping():

FILE: lavis/common/annotator/uniformer/mmcv/cnn/utils/fuse_conv_bn.py
  function _fuse_conv_bn (line 6) | def _fuse_conv_bn(conv, bn):
  function fuse_conv_bn (line 27) | def fuse_conv_bn(module):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/utils/sync_bn.py
  class _BatchNormXd (line 6) | class _BatchNormXd(torch.nn.modules.batchnorm._BatchNorm):
    method _check_input_dim (line 17) | def _check_input_dim(self, input):
  function revert_sync_batchnorm (line 21) | def revert_sync_batchnorm(module):

FILE: lavis/common/annotator/uniformer/mmcv/cnn/utils/weight_init.py
  function update_init_info (line 16) | def update_init_info(module, init_info):
  function constant_init (line 48) | def constant_init(module, val, bias=0):
  function xavier_init (line 55) | def xavier_init(module, gain=1, bias=0, distribution='normal'):
  function normal_init (line 66) | def normal_init(module, mean=0, std=1, bias=0):
  function trunc_normal_init (line 73) | def trunc_normal_init(module: nn.Module,
  function uniform_init (line 85) | def uniform_init(module, a=0, b=1, bias=0):
  function kaiming_init (line 92) | def kaiming_init(module,
  function caffe2_xavier_init (line 110) | def caffe2_xavier_init(module, bias=0):
  function bias_init_with_prob (line 122) | def bias_init_with_prob(prior_prob):
  function _get_bases_name (line 128) | def _get_bases_name(m):
  class BaseInit (line 132) | class BaseInit(object):
    method __init__ (line 134) | def __init__(self, *, bias=0, bias_prob=None, layer=None):
    method _get_init_info (line 157) | def _get_init_info(self):
  class ConstantInit (line 163) | class ConstantInit(BaseInit):
    method __init__ (line 175) | def __init__(self, val, **kwargs):
    method __call__ (line 179) | def __call__(self, module):
    method _get_init_info (line 194) | def _get_init_info(self):
  class XavierInit (line 200) | class XavierInit(BaseInit):
    method __init__ (line 217) | def __init__(self, gain=1, distribution='normal', **kwargs):
    method __call__ (line 222) | def __call__(self, module):
    method _get_init_info (line 237) | def _get_init_info(self):
  class NormalInit (line 244) | class NormalInit(BaseInit):
    method __init__ (line 260) | def __init__(self, mean=0, std=1, **kwargs):
    method __call__ (line 265) | def __call__(self, module):
    method _get_init_info (line 280) | def _get_init_info(self):
  class TruncNormalInit (line 287) | class TruncNormalInit(BaseInit):
    method __init__ (line 306) | def __init__(self,
    method __call__ (line 318) | def __call__(self, module: nn.Module) -> None:
    method _get_init_info (line 335) | def _get_init_info(self):
  class UniformInit (line 342) | class UniformInit(BaseInit):
    method __init__ (line 358) | def __init__(self, a=0, b=1, **kwargs):
    method __call__ (line 363) | def __call__(self, module):
    method _get_init_info (line 378) | def _get_init_info(self):
  class KaimingInit (line 385) | class KaimingInit(BaseInit):
    method __init__ (line 411) | def __init__(self,
    method __call__ (line 423) | def __call__(self, module):
    method _get_init_info (line 440) | def _get_init_info(self):
  class Caffe2XavierInit (line 448) | class Caffe2XavierInit(KaimingInit):
    method __init__ (line 451) | def __init__(self, **kwargs):
    method __call__ (line 459) | def __call__(self, module):
  class PretrainedInit (line 464) | class PretrainedInit(object):
    method __init__ (line 478) | def __init__(self, checkpoint, prefix=None, map_location=None):
    method __call__ (line 483) | def __call__(self, module):
    method _get_init_info (line 506) | def _get_init_info(self):
  function _initialize (line 511) | def _initialize(module, cfg, wholemodule=False):
  function _initialize_override (line 520) | def _initialize_override(module, override, cfg):
  function initialize (line 550) | def initialize(module, init_cfg):
  function _no_grad_trunc_normal_ (line 622) | def _no_grad_trunc_normal_(tensor: Tensor, mean: float, std: float, a: f...
  function trunc_normal_ (line 662) | def trunc_normal_(tensor: Tensor,

FILE: lavis/common/annotator/uniformer/mmcv/cnn/vgg.py
  function conv3x3 (line 9) | def conv3x3(in_planes, out_planes, dilation=1):
  function make_vgg_layer (line 19) | def make_vgg_layer(inplanes,
  class VGG (line 37) | class VGG(nn.Module):
    method __init__ (line 61) | def __init__(self,
    method init_weights (line 125) | def init_weights(self, pretrained=None):
    method forward (line 141) | def forward(self, x):
    method train (line 159) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmcv/engine/test.py
  function single_gpu_test (line 15) | def single_gpu_test(model, data_loader):
  function multi_gpu_test (line 44) | def multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=False):
  function collect_results_cpu (line 91) | def collect_results_cpu(result_part, size, tmpdir=None):
  function collect_results_gpu (line 155) | def collect_results_gpu(result_part, size):

FILE: lavis/common/annotator/uniformer/mmcv/fileio/file_client.py
  class BaseStorageBackend (line 19) | class BaseStorageBackend(metaclass=ABCMeta):
    method name (line 31) | def name(self):
    method allow_symlink (line 35) | def allow_symlink(self):
    method get (line 39) | def get(self, filepath):
    method get_text (line 43) | def get_text(self, filepath):
  class CephBackend (line 47) | class CephBackend(BaseStorageBackend):
    method __init__ (line 60) | def __init__(self, path_mapping=None):
    method get (line 72) | def get(self, filepath):
    method get_text (line 81) | def get_text(self, filepath, encoding=None):
  class PetrelBackend (line 85) | class PetrelBackend(BaseStorageBackend):
    method __init__ (line 108) | def __init__(self,
    method _map_path (line 121) | def _map_path(self, filepath: Union[str, Path]) -> str:
    method _format_path (line 134) | def _format_path(self, filepath: str) -> str:
    method get (line 147) | def get(self, filepath: Union[str, Path]) -> memoryview:
    method get_text (line 164) | def get_text(self,
    method put (line 179) | def put(self, obj: bytes, filepath: Union[str, Path]) -> None:
    method put_text (line 190) | def put_text(self,
    method remove (line 204) | def remove(self, filepath: Union[str, Path]) -> None:
    method exists (line 220) | def exists(self, filepath: Union[str, Path]) -> bool:
    method isdir (line 240) | def isdir(self, filepath: Union[str, Path]) -> bool:
    method isfile (line 261) | def isfile(self, filepath: Union[str, Path]) -> bool:
    method join_path (line 281) | def join_path(self, filepath: Union[str, Path],
    method get_local_path (line 300) | def get_local_path(self, filepath: Union[str, Path]) -> Iterable[str]:
    method list_dir_or_file (line 331) | def list_dir_or_file(self,
  class MemcachedBackend (line 413) | class MemcachedBackend(BaseStorageBackend):
    method __init__ (line 423) | def __init__(self, server_list_cfg, client_cfg, sys_path=None):
    method get (line 440) | def get(self, filepath):
    method get_text (line 447) | def get_text(self, filepath, encoding=None):
  class LmdbBackend (line 451) | class LmdbBackend(BaseStorageBackend):
    method __init__ (line 469) | def __init__(self,
    method get (line 488) | def get(self, filepath):
    method get_text (line 499) | def get_text(self, filepath, encoding=None):
  class HardDiskBackend (line 503) | class HardDiskBackend(BaseStorageBackend):
    method get (line 508) | def get(self, filepath: Union[str, Path]) -> bytes:
    method get_text (line 521) | def get_text(self,
    method put (line 538) | def put(self, obj: bytes, filepath: Union[str, Path]) -> None:
    method put_text (line 553) | def put_text(self,
    method remove (line 573) | def remove(self, filepath: Union[str, Path]) -> None:
    method exists (line 581) | def exists(self, filepath: Union[str, Path]) -> bool:
    method isdir (line 592) | def isdir(self, filepath: Union[str, Path]) -> bool:
    method isfile (line 605) | def isfile(self, filepath: Union[str, Path]) -> bool:
    method join_path (line 617) | def join_path(self, filepath: Union[str, Path],
    method get_local_path (line 633) | def get_local_path(
    method list_dir_or_file (line 638) | def list_dir_or_file(self,
  class HTTPBackend (line 691) | class HTTPBackend(BaseStorageBackend):
    method get (line 694) | def get(self, filepath):
    method get_text (line 698) | def get_text(self, filepath, encoding='utf-8'):
    method get_local_path (line 703) | def get_local_path(self, filepath: str) -> Iterable[str]:
  class FileClient (line 729) | class FileClient:
    method __new__ (line 787) | def __new__(cls, backend=None, prefix=None, **kwargs):
    method name (line 823) | def name(self):
    method allow_symlink (line 827) | def allow_symlink(self):
    method parse_uri_prefix (line 831) | def parse_uri_prefix(uri: Union[str, Path]) -> Optional[str]:
    method infer_client (line 858) | def infer_client(cls,
    method _register_backend (line 886) | def _register_backend(cls, name, backend, force=False, prefixes=None):
    method register_backend (line 922) | def register_backend(cls, name, backend=None, force=False, prefixes=No...
    method get (line 976) | def get(self, filepath: Union[str, Path]) -> Union[bytes, memoryview]:
    method get_text (line 994) | def get_text(self, filepath: Union[str, Path], encoding='utf-8') -> str:
    method put (line 1007) | def put(self, obj: bytes, filepath: Union[str, Path]) -> None:
    method put_text (line 1020) | def put_text(self, obj: str, filepath: Union[str, Path]) -> None:
    method remove (line 1035) | def remove(self, filepath: Union[str, Path]) -> None:
    method exists (line 1043) | def exists(self, filepath: Union[str, Path]) -> bool:
    method isdir (line 1054) | def isdir(self, filepath: Union[str, Path]) -> bool:
    method isfile (line 1067) | def isfile(self, filepath: Union[str, Path]) -> bool:
    method join_path (line 1079) | def join_path(self, filepath: Union[str, Path],
    method get_local_path (line 1095) | def get_local_path(self, filepath: Union[str, Path]) -> Iterable[str]:
    method list_dir_or_file (line 1123) | def list_dir_or_file(self,

FILE: lavis/common/annotator/uniformer/mmcv/fileio/handlers/base.py
  class BaseFileHandler (line 5) | class BaseFileHandler(metaclass=ABCMeta):
    method load_from_fileobj (line 13) | def load_from_fileobj(self, file, **kwargs):
    method dump_to_fileobj (line 17) | def dump_to_fileobj(self, obj, file, **kwargs):
    method dump_to_str (line 21) | def dump_to_str(self, obj, **kwargs):
    method load_from_path (line 24) | def load_from_path(self, filepath, mode='r', **kwargs):
    method dump_to_path (line 28) | def dump_to_path(self, obj, filepath, mode='w', **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/fileio/handlers/json_handler.py
  function set_default (line 9) | def set_default(obj):
  class JsonHandler (line 25) | class JsonHandler(BaseFileHandler):
    method load_from_fileobj (line 27) | def load_from_fileobj(self, file):
    method dump_to_fileobj (line 30) | def dump_to_fileobj(self, obj, file, **kwargs):
    method dump_to_str (line 34) | def dump_to_str(self, obj, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/fileio/handlers/pickle_handler.py
  class PickleHandler (line 7) | class PickleHandler(BaseFileHandler):
    method load_from_fileobj (line 11) | def load_from_fileobj(self, file, **kwargs):
    method load_from_path (line 14) | def load_from_path(self, filepath, **kwargs):
    method dump_to_str (line 18) | def dump_to_str(self, obj, **kwargs):
    method dump_to_fileobj (line 22) | def dump_to_fileobj(self, obj, file, **kwargs):
    method dump_to_path (line 26) | def dump_to_path(self, obj, filepath, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/fileio/handlers/yaml_handler.py
  class YamlHandler (line 12) | class YamlHandler(BaseFileHandler):
    method load_from_fileobj (line 14) | def load_from_fileobj(self, file, **kwargs):
    method dump_to_fileobj (line 18) | def dump_to_fileobj(self, obj, file, **kwargs):
    method dump_to_str (line 22) | def dump_to_str(self, obj, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/fileio/io.py
  function load (line 18) | def load(file, file_format=None, file_client_args=None, **kwargs):
  function dump (line 69) | def dump(obj, file=None, file_format=None, file_client_args=None, **kwar...
  function _register_handler (line 126) | def _register_handler(handler, file_formats):
  function register_handler (line 145) | def register_handler(file_formats, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/fileio/parse.py
  function list_from_file (line 8) | def list_from_file(filename,
  function dict_from_file (line 55) | def dict_from_file(filename,

FILE: lavis/common/annotator/uniformer/mmcv/image/colorspace.py
  function imconvert (line 6) | def imconvert(img, src, dst):
  function bgr2gray (line 22) | def bgr2gray(img, keepdim=False):
  function rgb2gray (line 39) | def rgb2gray(img, keepdim=False):
  function gray2bgr (line 56) | def gray2bgr(img):
  function gray2rgb (line 70) | def gray2rgb(img):
  function _convert_input_type_range (line 84) | def _convert_input_type_range(img):
  function _convert_output_type_range (line 112) | def _convert_output_type_range(img, dst_type):
  function rgb2ycbcr (line 143) | def rgb2ycbcr(img, y_only=False):
  function bgr2ycbcr (line 177) | def bgr2ycbcr(img, y_only=False):
  function ycbcr2rgb (line 211) | def ycbcr2rgb(img):
  function ycbcr2bgr (line 243) | def ycbcr2bgr(img):
  function convert_color_factory (line 275) | def convert_color_factory(src, dst):

FILE: lavis/common/annotator/uniformer/mmcv/image/geometric.py
  function _scale_size (line 16) | def _scale_size(size, scale):
  function imresize (line 51) | def imresize(img,
  function imresize_to_multiple (line 98) | def imresize_to_multiple(img,
  function imresize_like (line 162) | def imresize_like(img,
  function rescale_size (line 184) | def rescale_size(old_size, scale, return_scale=False):
  function imrescale (line 221) | def imrescale(img,
  function imflip (line 252) | def imflip(img, direction='horizontal'):
  function imflip_ (line 272) | def imflip_(img, direction='horizontal'):
  function imrotate (line 292) | def imrotate(img,
  function bbox_clip (line 342) | def bbox_clip(bboxes, img_shape):
  function bbox_scaling (line 360) | def bbox_scaling(bboxes, scale, clip_shape=None):
  function imcrop (line 386) | def imcrop(img, bboxes, scale=1.0, pad_fill=None):
  function impad (line 440) | def impad(img,
  function impad_to_multiple (line 522) | def impad_to_multiple(img, divisor, pad_val=0):
  function cutout (line 538) | def cutout(img, shape, pad_val=0):
  function _get_shear_matrix (line 593) | def _get_shear_matrix(magnitude, direction='horizontal'):
  function imshear (line 611) | def imshear(img,
  function _get_translate_matrix (line 662) | def _get_translate_matrix(offset, direction='horizontal'):
  function imtranslate (line 680) | def imtranslate(img,

FILE: lavis/common/annotator/uniformer/mmcv/image/io.py
  function use_backend (line 43) | def use_backend(backend):
  function _jpegflag (line 69) | def _jpegflag(flag='color', channel_order='bgr'):
  function _pillow2array (line 85) | def _pillow2array(img, flag='color', channel_order='bgr'):
  function imread (line 140) | def imread(img_or_path, flag='color', channel_order='bgr', backend=None):
  function imfrombytes (line 203) | def imfrombytes(content, flag='color', channel_order='bgr', backend=None):
  function imwrite (line 242) | def imwrite(img, file_path, params=None, auto_mkdir=True):

FILE: lavis/common/annotator/uniformer/mmcv/image/misc.py
  function tensor2imgs (line 12) | def tensor2imgs(tensor, mean=(0, 0, 0), std=(1, 1, 1), to_rgb=True):

FILE: lavis/common/annotator/uniformer/mmcv/image/photometric.py
  function imnormalize (line 9) | def imnormalize(img, mean, std, to_rgb=True):
  function imnormalize_ (line 25) | def imnormalize_(img, mean, std, to_rgb=True):
  function imdenormalize (line 48) | def imdenormalize(img, mean, std, to_bgr=True):
  function iminvert (line 59) | def iminvert(img):
  function solarize (line 71) | def solarize(img, thr=128):
  function posterize (line 85) | def posterize(img, bits):
  function adjust_color (line 100) | def adjust_color(img, alpha=1, beta=None, gamma=0):
  function imequalize (line 131) | def imequalize(img):
  function adjust_brightness (line 176) | def adjust_brightness(img, factor=1.):
  function adjust_contrast (line 208) | def adjust_contrast(img, factor=1.):
  function auto_contrast (line 238) | def auto_contrast(img, cutoff=0):
  function adjust_sharpness (line 294) | def adjust_sharpness(img, factor=1., kernel=None):
  function adjust_lighting (line 338) | def adjust_lighting(img, eigval, eigvec, alphastd=0.1, to_rgb=True):
  function lut_transform (line 381) | def lut_transform(img, lut_table):
  function clahe (line 405) | def clahe(img, clip_limit=40.0, tile_grid_size=(8, 8)):

FILE: lavis/common/annotator/uniformer/mmcv/ops/assign_score_withk.py
  class AssignScoreWithK (line 9) | class AssignScoreWithK(Function):
    method forward (line 29) | def forward(ctx,
    method backward (line 81) | def backward(ctx, grad_out):

FILE: lavis/common/annotator/uniformer/mmcv/ops/ball_query.py
  class BallQuery (line 10) | class BallQuery(Function):
    method forward (line 14) | def forward(ctx, min_radius: float, max_radius: float, sample_num: int,
    method backward (line 51) | def backward(ctx, a=None):

FILE: lavis/common/annotator/uniformer/mmcv/ops/bbox.py
  function bbox_overlaps (line 7) | def bbox_overlaps(bboxes1, bboxes2, mode='iou', aligned=False, offset=0):

FILE: lavis/common/annotator/uniformer/mmcv/ops/border_align.py
  class BorderAlignFunction (line 16) | class BorderAlignFunction(Function):
    method symbolic (line 19) | def symbolic(g, input, boxes, pool_size):
    method forward (line 24) | def forward(ctx, input, boxes, pool_size):
    method backward (line 48) | def backward(ctx, grad_output):
  class BorderAlign (line 65) | class BorderAlign(nn.Module):
    method __init__ (line 88) | def __init__(self, pool_size):
    method forward (line 92) | def forward(self, input, boxes):
    method __repr__ (line 106) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/box_iou_rotated.py
  function box_iou_rotated (line 7) | def box_iou_rotated(bboxes1, bboxes2, mode='iou', aligned=False):

FILE: lavis/common/annotator/uniformer/mmcv/ops/carafe.py
  class CARAFENaiveFunction (line 17) | class CARAFENaiveFunction(Function):
    method symbolic (line 20) | def symbolic(g, features, masks, kernel_size, group_size, scale_factor):
    method forward (line 30) | def forward(ctx, features, masks, kernel_size, group_size, scale_factor):
    method backward (line 58) | def backward(ctx, grad_output):
  class CARAFENaive (line 84) | class CARAFENaive(Module):
    method __init__ (line 86) | def __init__(self, kernel_size, group_size, scale_factor):
    method forward (line 95) | def forward(self, features, masks):
  class CARAFEFunction (line 100) | class CARAFEFunction(Function):
    method symbolic (line 103) | def symbolic(g, features, masks, kernel_size, group_size, scale_factor):
    method forward (line 113) | def forward(ctx, features, masks, kernel_size, group_size, scale_factor):
    method backward (line 147) | def backward(ctx, grad_output):
  class CARAFE (line 180) | class CARAFE(Module):
    method __init__ (line 194) | def __init__(self, kernel_size, group_size, scale_factor):
    method forward (line 203) | def forward(self, features, masks):
  class CARAFEPack (line 209) | class CARAFEPack(nn.Module):
    method __init__ (line 230) | def __init__(self,
    method init_weights (line 258) | def init_weights(self):
    method kernel_normalizer (line 264) | def kernel_normalizer(self, mask):
    method feature_reassemble (line 277) | def feature_reassemble(self, x, mask):
    method forward (line 281) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/ops/cc_attention.py
  function NEG_INF_DIAG (line 9) | def NEG_INF_DIAG(n, device):
  class CrissCrossAttention (line 19) | class CrissCrossAttention(nn.Module):
    method __init__ (line 44) | def __init__(self, in_channels):
    method forward (line 52) | def forward(self, x):
    method __repr__ (line 80) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/contour_expand.py
  function contour_expand (line 10) | def contour_expand(kernel_mask, internal_kernel_label, min_kernel_area,

FILE: lavis/common/annotator/uniformer/mmcv/ops/corner_pool.py
  class TopPoolFunction (line 17) | class TopPoolFunction(Function):
    method symbolic (line 20) | def symbolic(g, input):
    method forward (line 26) | def forward(ctx, input):
    method backward (line 32) | def backward(ctx, grad_output):
  class BottomPoolFunction (line 38) | class BottomPoolFunction(Function):
    method symbolic (line 41) | def symbolic(g, input):
    method forward (line 47) | def forward(ctx, input):
    method backward (line 53) | def backward(ctx, grad_output):
  class LeftPoolFunction (line 59) | class LeftPoolFunction(Function):
    method symbolic (line 62) | def symbolic(g, input):
    method forward (line 68) | def forward(ctx, input):
    method backward (line 74) | def backward(ctx, grad_output):
  class RightPoolFunction (line 80) | class RightPoolFunction(Function):
    method symbolic (line 83) | def symbolic(g, input):
    method forward (line 89) | def forward(ctx, input):
    method backward (line 95) | def backward(ctx, grad_output):
  class CornerPool (line 101) | class CornerPool(nn.Module):
    method __init__ (line 136) | def __init__(self, mode):
    method forward (line 142) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/ops/correlation.py
  class CorrelationFunction (line 14) | class CorrelationFunction(Function):
    method forward (line 17) | def forward(ctx,
    method backward (line 63) | def backward(ctx, grad_output):
    method _output_size (line 96) | def _output_size(ctx, input1):
  class Correlation (line 114) | class Correlation(nn.Module):
    method __init__ (line 167) | def __init__(self,
    method forward (line 182) | def forward(self, input1: Tensor, input2: Tensor) -> Tensor:
    method __repr__ (line 188) | def __repr__(self) -> str:

FILE: lavis/common/annotator/uniformer/mmcv/ops/deform_conv.py
  class DeformConv2dFunction (line 22) | class DeformConv2dFunction(Function):
    method symbolic (line 25) | def symbolic(g,
    method forward (line 50) | def forward(ctx,
    method backward (line 114) | def backward(ctx, grad_output):
    method _output_size (line 173) | def _output_size(ctx, input, weight):
  class DeformConv2d (line 192) | class DeformConv2d(nn.Module):
    method __init__ (line 228) | def __init__(self,
    method reset_parameters (line 269) | def reset_parameters(self):
    method forward (line 276) | def forward(self, x: Tensor, offset: Tensor) -> Tensor:
    method __repr__ (line 315) | def __repr__(self):
  class DeformConv2dPack (line 331) | class DeformConv2dPack(DeformConv2d):
    method __init__ (line 358) | def __init__(self, *args, **kwargs):
    method init_offset (line 370) | def init_offset(self):
    method forward (line 374) | def forward(self, x):
    method _load_from_state_dict (line 380) | def _load_from_state_dict(self, state_dict, prefix, local_metadata, st...

FILE: lavis/common/annotator/uniformer/mmcv/ops/deform_roi_pool.py
  class DeformRoIPoolFunction (line 13) | class DeformRoIPoolFunction(Function):
    method symbolic (line 16) | def symbolic(g, input, rois, offset, output_size, spatial_scale,
    method forward (line 30) | def forward(ctx,
    method backward (line 67) | def backward(ctx, grad_output):
  class DeformRoIPool (line 92) | class DeformRoIPool(nn.Module):
    method __init__ (line 94) | def __init__(self,
    method forward (line 105) | def forward(self, input, rois, offset=None):
  class DeformRoIPoolPack (line 111) | class DeformRoIPoolPack(DeformRoIPool):
    method __init__ (line 113) | def __init__(self,
    method forward (line 138) | def forward(self, input, rois):
  class ModulatedDeformRoIPoolPack (line 152) | class ModulatedDeformRoIPoolPack(DeformRoIPool):
    method __init__ (line 154) | def __init__(self,
    method forward (line 190) | def forward(self, input, rois):

FILE: lavis/common/annotator/uniformer/mmcv/ops/deprecated_wrappers.py
  class Conv2d_deprecated (line 9) | class Conv2d_deprecated(Conv2d):
    method __init__ (line 11) | def __init__(self, *args, **kwargs):
  class ConvTranspose2d_deprecated (line 18) | class ConvTranspose2d_deprecated(ConvTranspose2d):
    method __init__ (line 20) | def __init__(self, *args, **kwargs):
  class MaxPool2d_deprecated (line 28) | class MaxPool2d_deprecated(MaxPool2d):
    method __init__ (line 30) | def __init__(self, *args, **kwargs):
  class Linear_deprecated (line 37) | class Linear_deprecated(Linear):
    method __init__ (line 39) | def __init__(self, *args, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/ops/focal_loss.py
  class SigmoidFocalLossFunction (line 15) | class SigmoidFocalLossFunction(Function):
    method symbolic (line 18) | def symbolic(g, input, target, gamma, alpha, weight, reduction):
    method forward (line 29) | def forward(ctx,
    method backward (line 66) | def backward(ctx, grad_output):
  class SigmoidFocalLoss (line 88) | class SigmoidFocalLoss(nn.Module):
    method __init__ (line 90) | def __init__(self, gamma, alpha, weight=None, reduction='mean'):
    method forward (line 97) | def forward(self, input, target):
    method __repr__ (line 101) | def __repr__(self):
  class SoftmaxFocalLossFunction (line 109) | class SoftmaxFocalLossFunction(Function):
    method symbolic (line 112) | def symbolic(g, input, target, gamma, alpha, weight, reduction):
    method forward (line 123) | def forward(ctx,
    method backward (line 171) | def backward(ctx, grad_output):
  class SoftmaxFocalLoss (line 194) | class SoftmaxFocalLoss(nn.Module):
    method __init__ (line 196) | def __init__(self, gamma, alpha, weight=None, reduction='mean'):
    method forward (line 203) | def forward(self, input, target):
    method __repr__ (line 207) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/furthest_point_sample.py
  class FurthestPointSampling (line 12) | class FurthestPointSampling(Function):
    method forward (line 17) | def forward(ctx, points_xyz: torch.Tensor,
    method backward (line 46) | def backward(xyz, a=None):
  class FurthestPointSamplingWithDist (line 50) | class FurthestPointSamplingWithDist(Function):
    method forward (line 55) | def forward(ctx, points_dist: torch.Tensor,
    method backward (line 78) | def backward(xyz, a=None):

FILE: lavis/common/annotator/uniformer/mmcv/ops/fused_bias_leakyrelu.py
  class FusedBiasLeakyReLUFunctionBackward (line 108) | class FusedBiasLeakyReLUFunctionBackward(Function):
    method forward (line 116) | def forward(ctx, grad_output, out, negative_slope, scale):
    method backward (line 142) | def backward(ctx, gradgrad_input, gradgrad_bias):
  class FusedBiasLeakyReLUFunction (line 160) | class FusedBiasLeakyReLUFunction(Function):
    method forward (line 163) | def forward(ctx, input, bias, negative_slope, scale):
    method backward (line 181) | def backward(ctx, grad_output):
  class FusedBiasLeakyReLU (line 190) | class FusedBiasLeakyReLU(nn.Module):
    method __init__ (line 213) | def __init__(self, num_channels, negative_slope=0.2, scale=2**0.5):
    method forward (line 220) | def forward(self, input):
  function fused_bias_leakyrelu (line 225) | def fused_bias_leakyrelu(input, bias, negative_slope=0.2, scale=2**0.5):
  function bias_leakyrelu_ref (line 257) | def bias_leakyrelu_ref(x, bias, negative_slope=0.2, scale=2**0.5):

FILE: lavis/common/annotator/uniformer/mmcv/ops/gather_points.py
  class GatherPoints (line 10) | class GatherPoints(Function):
    method forward (line 14) | def forward(ctx, features: torch.Tensor,
    method backward (line 40) | def backward(ctx, grad_out):

FILE: lavis/common/annotator/uniformer/mmcv/ops/group_points.py
  class QueryAndGroup (line 16) | class QueryAndGroup(nn.Module):
    method __init__ (line 39) | def __init__(self,
    method forward (line 67) | def forward(self, points_xyz, center_xyz, features=None):
  class GroupAll (line 134) | class GroupAll(nn.Module):
    method __init__ (line 141) | def __init__(self, use_xyz: bool = True):
    method forward (line 145) | def forward(self,
  class GroupingOperation (line 173) | class GroupingOperation(Function):
    method forward (line 177) | def forward(ctx, features: torch.Tensor,
    method backward (line 202) | def backward(ctx,

FILE: lavis/common/annotator/uniformer/mmcv/ops/info.py
  function get_compiler_version (line 10) | def get_compiler_version():
  function get_compiling_cuda_version (line 13) | def get_compiling_cuda_version():
  function get_compiler_version (line 20) | def get_compiler_version():
  function get_compiling_cuda_version (line 23) | def get_compiling_cuda_version():
  function get_onnxruntime_op_path (line 27) | def get_onnxruntime_op_path():

FILE: lavis/common/annotator/uniformer/mmcv/ops/iou3d.py
  function boxes_iou_bev (line 12) | def boxes_iou_bev(boxes_a, boxes_b):
  function nms_bev (line 31) | def nms_bev(boxes, scores, thresh, pre_max_size=None, post_max_size=None):
  function nms_normal_bev (line 65) | def nms_normal_bev(boxes, scores, thresh):

FILE: lavis/common/annotator/uniformer/mmcv/ops/knn.py
  class KNN (line 9) | class KNN(Function):
    method forward (line 18) | def forward(ctx,
    method backward (line 73) | def backward(ctx, a=None):

FILE: lavis/common/annotator/uniformer/mmcv/ops/masked_conv.py
  class MaskedConv2dFunction (line 16) | class MaskedConv2dFunction(Function):
    method symbolic (line 19) | def symbolic(g, features, mask, weight, bias, padding, stride):
    method forward (line 30) | def forward(ctx, features, mask, weight, bias, padding=0, stride=1):
    method backward (line 79) | def backward(ctx, grad_output):
  class MaskedConv2d (line 86) | class MaskedConv2d(nn.Conv2d):
    method __init__ (line 93) | def __init__(self,
    method forward (line 106) | def forward(self, input, mask=None):

FILE: lavis/common/annotator/uniformer/mmcv/ops/merge_cells.py
  class BaseMergeCell (line 11) | class BaseMergeCell(nn.Module):
    method __init__ (line 43) | def __init__(self,
    method _build_input_conv (line 78) | def _build_input_conv(self, channel, conv_cfg, norm_cfg):
    method _binary_op (line 89) | def _binary_op(self, x1, x2):
    method _resize (line 92) | def _resize(self, x, size):
    method forward (line 103) | def forward(self, x1, x2, out_size=None):
  class SumCell (line 121) | class SumCell(BaseMergeCell):
    method __init__ (line 123) | def __init__(self, in_channels, out_channels, **kwargs):
    method _binary_op (line 126) | def _binary_op(self, x1, x2):
  class ConcatCell (line 130) | class ConcatCell(BaseMergeCell):
    method __init__ (line 132) | def __init__(self, in_channels, out_channels, **kwargs):
    method _binary_op (line 136) | def _binary_op(self, x1, x2):
  class GlobalPoolingCell (line 141) | class GlobalPoolingCell(BaseMergeCell):
    method __init__ (line 143) | def __init__(self, in_channels=None, out_channels=None, **kwargs):
    method _binary_op (line 147) | def _binary_op(self, x1, x2):

FILE: lavis/common/annotator/uniformer/mmcv/ops/modulated_deform_conv.py
  class ModulatedDeformConv2dFunction (line 19) | class ModulatedDeformConv2dFunction(Function):
    method symbolic (line 22) | def symbolic(g, input, offset, mask, weight, bias, stride, padding,
    method forward (line 37) | def forward(ctx,
    method backward (line 97) | def backward(ctx, grad_output):
    method _output_size (line 137) | def _output_size(ctx, input, weight):
  class ModulatedDeformConv2d (line 156) | class ModulatedDeformConv2d(nn.Module):
    method __init__ (line 160) | def __init__(self,
    method init_weights (line 192) | def init_weights(self):
    method forward (line 201) | def forward(self, x, offset, mask):
  class ModulatedDeformConv2dPack (line 209) | class ModulatedDeformConv2dPack(ModulatedDeformConv2d):
    method __init__ (line 228) | def __init__(self, *args, **kwargs):
    method init_weights (line 240) | def init_weights(self):
    method forward (line 246) | def forward(self, x):
    method _load_from_state_dict (line 256) | def _load_from_state_dict(self, state_dict, prefix, local_metadata, st...

FILE: lavis/common/annotator/uniformer/mmcv/ops/multi_scale_deform_attn.py
  class MultiScaleDeformableAttnFunction (line 20) | class MultiScaleDeformableAttnFunction(Function):
    method forward (line 23) | def forward(ctx, value, value_spatial_shapes, value_level_start_index,
    method backward (line 61) | def backward(ctx, grad_output):
  function multi_scale_deformable_attn_pytorch (line 94) | def multi_scale_deformable_attn_pytorch(value, value_spatial_shapes,
  class MultiScaleDeformableAttention (line 155) | class MultiScaleDeformableAttention(BaseModule):
    method __init__ (line 182) | def __init__(self,
    method init_weights (line 230) | def init_weights(self):
    method forward (line 252) | def forward(self,

FILE: lavis/common/annotator/uniformer/mmcv/ops/nms.py
  class NMSop (line 14) | class NMSop(torch.autograd.Function):
    method forward (line 17) | def forward(ctx, bboxes, scores, iou_threshold, offset, score_threshold,
    method symbolic (line 36) | def symbolic(g, bboxes, scores, iou_threshold, offset, score_threshold,
  class SoftNMSop (line 82) | class SoftNMSop(torch.autograd.Function):
    method forward (line 85) | def forward(ctx, boxes, scores, iou_threshold, sigma, min_score, method,
    method symbolic (line 100) | def symbolic(g, boxes, scores, iou_threshold, sigma, min_score, method,
  function nms (line 118) | def nms(boxes, scores, iou_threshold, offset=0, score_threshold=0, max_n...
  function soft_nms (line 181) | def soft_nms(boxes,
  function batched_nms (line 260) | def batched_nms(boxes, scores, idxs, nms_cfg, class_agnostic=False):
  function nms_match (line 339) | def nms_match(dets, iou_threshold):
  function nms_rotated (line 376) | def nms_rotated(dets, scores, iou_threshold, labels=None):

FILE: lavis/common/annotator/uniformer/mmcv/ops/pixel_group.py
  function pixel_group (line 10) | def pixel_group(score, mask, embedding, kernel_label, kernel_contour,

FILE: lavis/common/annotator/uniformer/mmcv/ops/point_sample.py
  function bilinear_grid_sample (line 12) | def bilinear_grid_sample(im, grid, align_corners=False):
  function is_in_onnx_export_without_custom_ops (line 87) | def is_in_onnx_export_without_custom_ops():
  function normalize (line 94) | def normalize(grid):
  function denormalize (line 105) | def denormalize(grid):
  function generate_grid (line 116) | def generate_grid(num_grid, size, device):
  function rel_roi_point_to_abs_img_point (line 137) | def rel_roi_point_to_abs_img_point(rois, rel_roi_points):
  function get_shape_from_feature_map (line 168) | def get_shape_from_feature_map(x):
  function abs_img_point_to_rel_img_point (line 186) | def abs_img_point_to_rel_img_point(abs_img_points, img, spatial_scale=1.):
  function rel_roi_point_to_rel_img_point (line 216) | def rel_roi_point_to_rel_img_point(rois,
  function point_sample (line 242) | def point_sample(input, points, align_corners=False, **kwargs):
  class SimpleRoIAlign (line 276) | class SimpleRoIAlign(nn.Module):
    method __init__ (line 278) | def __init__(self, output_size, spatial_scale, aligned=True):
    method forward (line 296) | def forward(self, features, rois):
    method __repr__ (line 332) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/points_in_boxes.py
  function points_in_boxes_part (line 11) | def points_in_boxes_part(points, boxes):
  function points_in_boxes_cpu (line 58) | def points_in_boxes_cpu(points, boxes):
  function points_in_boxes_all (line 95) | def points_in_boxes_all(points, boxes):

FILE: lavis/common/annotator/uniformer/mmcv/ops/points_sampler.py
  function calc_square_dist (line 11) | def calc_square_dist(point_feat_a, point_feat_b, norm=True):
  function get_sampler_cls (line 37) | def get_sampler_cls(sampler_type):
  class PointsSampler (line 60) | class PointsSampler(nn.Module):
    method __init__ (line 74) | def __init__(self,
    method forward (line 92) | def forward(self, points_xyz, features):
  class DFPSSampler (line 133) | class DFPSSampler(nn.Module):
    method __init__ (line 136) | def __init__(self):
    method forward (line 139) | def forward(self, points, features, npoint):
  class FFPSSampler (line 145) | class FFPSSampler(nn.Module):
    method __init__ (line 148) | def __init__(self):
    method forward (line 151) | def forward(self, points, features, npoint):
  class FSSampler (line 162) | class FSSampler(nn.Module):
    method __init__ (line 165) | def __init__(self):
    method forward (line 168) | def forward(self, points, features, npoint):

FILE: lavis/common/annotator/uniformer/mmcv/ops/psa_mask.py
  class PSAMaskFunction (line 12) | class PSAMaskFunction(Function):
    method symbolic (line 15) | def symbolic(g, input, psa_type, mask_size):
    method forward (line 23) | def forward(ctx, input, psa_type, mask_size):
    method backward (line 48) | def backward(ctx, grad_output):
  class PSAMask (line 72) | class PSAMask(nn.Module):
    method __init__ (line 74) | def __init__(self, psa_type, mask_size=None):
    method forward (line 85) | def forward(self, input):
    method __repr__ (line 88) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/roi_align.py
  class RoIAlignFunction (line 14) | class RoIAlignFunction(Function):
    method symbolic (line 17) | def symbolic(g, input, rois, output_size, spatial_scale, sampling_ratio,
    method forward (line 64) | def forward(ctx,
    method backward (line 110) | def backward(ctx, grad_output):
  class RoIAlign (line 133) | class RoIAlign(nn.Module):
    method __init__ (line 176) | def __init__(self,
    method forward (line 192) | def forward(self, input, rois):
    method __repr__ (line 215) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/roi_align_rotated.py
  class RoIAlignRotatedFunction (line 11) | class RoIAlignRotatedFunction(Function):
    method symbolic (line 14) | def symbolic(g, features, rois, out_size, spatial_scale, sample_num,
    method forward (line 39) | def forward(ctx,
    method backward (line 82) | def backward(ctx, grad_output):
  class RoIAlignRotated (line 116) | class RoIAlignRotated(nn.Module):
    method __init__ (line 159) | def __init__(self,
    method forward (line 173) | def forward(self, features, rois):

FILE: lavis/common/annotator/uniformer/mmcv/ops/roi_pool.py
  class RoIPoolFunction (line 14) | class RoIPoolFunction(Function):
    method symbolic (line 17) | def symbolic(g, input, rois, output_size, spatial_scale):
    method forward (line 26) | def forward(ctx, input, rois, output_size, spatial_scale=1.0):
    method backward (line 52) | def backward(ctx, grad_output):
  class RoIPool (line 71) | class RoIPool(nn.Module):
    method __init__ (line 73) | def __init__(self, output_size, spatial_scale=1.0):
    method forward (line 79) | def forward(self, input, rois):
    method __repr__ (line 82) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/roiaware_pool3d.py
  class RoIAwarePool3d (line 13) | class RoIAwarePool3d(nn.Module):
    method __init__ (line 28) | def __init__(self, out_size, max_pts_per_voxel=128, mode='max'):
    method forward (line 37) | def forward(self, rois, pts, pts_feature):
  class RoIAwarePool3dFunction (line 54) | class RoIAwarePool3dFunction(Function):
    method forward (line 57) | def forward(ctx, rois, pts, pts_feature, out_size, max_pts_per_voxel,
    method backward (line 105) | def backward(ctx, grad_out):

FILE: lavis/common/annotator/uniformer/mmcv/ops/roipoint_pool3d.py
  class RoIPointPool3d (line 9) | class RoIPointPool3d(nn.Module):
    method __init__ (line 20) | def __init__(self, num_sampled_points=512):
    method forward (line 24) | def forward(self, points, point_features, boxes3d):
  class RoIPointPool3dFunction (line 41) | class RoIPointPool3dFunction(Function):
    method forward (line 44) | def forward(ctx, points, point_features, boxes3d, num_sampled_points=5...
    method backward (line 76) | def backward(ctx, grad_out):

FILE: lavis/common/annotator/uniformer/mmcv/ops/saconv.py
  class SAConv2d (line 12) | class SAConv2d(ConvAWS2d):
    method __init__ (line 37) | def __init__(self,
    method init_weights (line 81) | def init_weights(self):
    method forward (line 90) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmcv/ops/scatter_points.py
  class _DynamicScatter (line 13) | class _DynamicScatter(Function):
    method forward (line 16) | def forward(ctx, feats, coors, reduce_type='max'):
    method backward (line 44) | def backward(ctx, grad_voxel_feats, grad_voxel_coors=None):
  class DynamicScatter (line 59) | class DynamicScatter(nn.Module):
    method __init__ (line 75) | def __init__(self, voxel_size, point_cloud_range, average_points: bool):
    method forward_single (line 82) | def forward_single(self, points, coors):
    method forward (line 98) | def forward(self, points, coors):
    method __repr__ (line 129) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/sync_bn.py
  class SyncBatchNormFunction (line 19) | class SyncBatchNormFunction(Function):
    method symbolic (line 22) | def symbolic(g, input, running_mean, running_var, weight, bias, momentum,
    method forward (line 38) | def forward(self, input, running_mean, running_var, weight, bias, mome...
    method backward (line 129) | def backward(self, grad_output):
  class SyncBatchNorm (line 159) | class SyncBatchNorm(Module):
    method __init__ (line 193) | def __init__(self,
    method reset_running_stats (line 230) | def reset_running_stats(self):
    method reset_parameters (line 236) | def reset_parameters(self):
    method forward (line 242) | def forward(self, input):
    method __repr__ (line 270) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/ops/three_interpolate.py
  class ThreeInterpolate (line 12) | class ThreeInterpolate(Function):
    method forward (line 20) | def forward(ctx, features: torch.Tensor, indices: torch.Tensor,
    method backward (line 47) | def backward(

FILE: lavis/common/annotator/uniformer/mmcv/ops/three_nn.py
  class ThreeNN (line 11) | class ThreeNN(Function):
    method forward (line 19) | def forward(ctx, target: torch.Tensor,
    method backward (line 47) | def backward(ctx, a=None, b=None):

FILE: lavis/common/annotator/uniformer/mmcv/ops/tin_shift.py
  class TINShiftFunction (line 17) | class TINShiftFunction(Function):
    method forward (line 20) | def forward(ctx, input, shift):
    method backward (line 35) | def backward(ctx, grad_output):
  class TINShift (line 48) | class TINShift(nn.Module):
    method forward (line 58) | def forward(self, input, shift):

FILE: lavis/common/annotator/uniformer/mmcv/ops/upfirdn2d.py
  class UpFirDn2dBackward (line 108) | class UpFirDn2dBackward(Function):
    method forward (line 111) | def forward(ctx, grad_output, kernel, grad_kernel, up, down, pad, g_pad,
    method backward (line 152) | def backward(ctx, gradgrad_input):
  class UpFirDn2d (line 177) | class UpFirDn2d(Function):
    method forward (line 180) | def forward(ctx, input, kernel, up, down, pad):
    method backward (line 225) | def backward(ctx, grad_output):
  function upfirdn2d (line 243) | def upfirdn2d(input, kernel, up=1, down=1, pad=(0, 0)):
  function upfirdn2d_native (line 290) | def upfirdn2d_native(input, kernel, up_x, up_y, down_x, down_y, pad_x0, ...

FILE: lavis/common/annotator/uniformer/mmcv/ops/voxelize.py
  class _Voxelization (line 13) | class _Voxelization(Function):
    method forward (line 16) | def forward(ctx,
  class Voxelization (line 71) | class Voxelization(nn.Module):
    method __init__ (line 89) | def __init__(self,
    method forward (line 116) | def forward(self, input):
    method __repr__ (line 125) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/_functions.py
  function scatter (line 6) | def scatter(input, devices, streams=None):
  function synchronize_stream (line 34) | def synchronize_stream(output, devices, streams):
  function get_input_device (line 51) | def get_input_device(input):
  class Scatter (line 64) | class Scatter:
    method forward (line 67) | def forward(target_gpus, input):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/collate.py
  function collate (line 11) | def collate(batch, samples_per_gpu=1):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/data_container.py
  function assert_tensor_type (line 7) | def assert_tensor_type(func):
  class DataContainer (line 20) | class DataContainer:
    method __init__ (line 37) | def __init__(self,
    method __repr__ (line 50) | def __repr__(self):
    method __len__ (line 53) | def __len__(self):
    method data (line 57) | def data(self):
    method datatype (line 61) | def datatype(self):
    method cpu_only (line 68) | def cpu_only(self):
    method stack (line 72) | def stack(self):
    method padding_value (line 76) | def padding_value(self):
    method pad_dims (line 80) | def pad_dims(self):
    method size (line 84) | def size(self, *args, **kwargs):
    method dim (line 88) | def dim(self):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/data_parallel.py
  class MMDataParallel (line 9) | class MMDataParallel(DataParallel):
    method __init__ (line 26) | def __init__(self, *args, dim=0, **kwargs):
    method forward (line 30) | def forward(self, *inputs, **kwargs):
    method scatter (line 44) | def scatter(self, inputs, kwargs, device_ids):
    method train_step (line 47) | def train_step(self, *inputs, **kwargs):
    method val_step (line 69) | def val_step(self, *inputs, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/distributed.py
  class MMDistributedDataParallel (line 11) | class MMDistributedDataParallel(DistributedDataParallel):
    method to_kwargs (line 21) | def to_kwargs(self, inputs, kwargs, device_id):
    method scatter (line 26) | def scatter(self, inputs, kwargs, device_ids):
    method train_step (line 29) | def train_step(self, *inputs, **kwargs):
    method val_step (line 72) | def val_step(self, *inputs, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/distributed_deprecated.py
  class MMDistributedDataParallel (line 14) | class MMDistributedDataParallel(nn.Module):
    method __init__ (line 16) | def __init__(self,
    method _dist_broadcast_coalesced (line 29) | def _dist_broadcast_coalesced(self, tensors, buffer_size):
    method _sync_params (line 37) | def _sync_params(self):
    method scatter (line 52) | def scatter(self, inputs, kwargs, device_ids):
    method forward (line 55) | def forward(self, *inputs, **kwargs):
    method train_step (line 60) | def train_step(self, *inputs, **kwargs):
    method val_step (line 66) | def val_step(self, *inputs, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/scatter_gather.py
  function scatter (line 9) | def scatter(inputs, target_gpus, dim=0):
  function scatter_kwargs (line 49) | def scatter_kwargs(inputs, kwargs, target_gpus, dim=0):

FILE: lavis/common/annotator/uniformer/mmcv/parallel/utils.py
  function is_module_wrapper (line 5) | def is_module_wrapper(module):

FILE: lavis/common/annotator/uniformer/mmcv/runner/base_module.py
  class BaseModule (line 14) | class BaseModule(nn.Module, metaclass=ABCMeta):
    method __init__ (line 33) | def __init__(self, init_cfg=None):
    method is_init (line 53) | def is_init(self):
    method init_weights (line 56) | def init_weights(self):
    method _dump_init_info (line 137) | def _dump_init_info(self, logger_name):
    method __repr__ (line 166) | def __repr__(self):
  class Sequential (line 173) | class Sequential(BaseModule, nn.Sequential):
    method __init__ (line 180) | def __init__(self, *args, init_cfg=None):
  class ModuleList (line 185) | class ModuleList(BaseModule, nn.ModuleList):
    method __init__ (line 193) | def __init__(self, modules=None, init_cfg=None):

FILE: lavis/common/annotator/uniformer/mmcv/runner/base_runner.py
  class BaseRunner (line 21) | class BaseRunner(metaclass=ABCMeta):
    method __init__ (line 51) | def __init__(self,
    method model_name (line 139) | def model_name(self):
    method rank (line 144) | def rank(self):
    method world_size (line 149) | def world_size(self):
    method hooks (line 155) | def hooks(self):
    method epoch (line 160) | def epoch(self):
    method iter (line 165) | def iter(self):
    method inner_iter (line 170) | def inner_iter(self):
    method max_epochs (line 175) | def max_epochs(self):
    method max_iters (line 180) | def max_iters(self):
    method train (line 185) | def train(self):
    method val (line 189) | def val(self):
    method run (line 193) | def run(self, data_loaders, workflow, **kwargs):
    method save_checkpoint (line 197) | def save_checkpoint(self,
    method current_lr (line 205) | def current_lr(self):
    method current_momentum (line 224) | def current_momentum(self):
    method register_hook (line 255) | def register_hook(self, hook, priority='NORMAL'):
    method register_hook_from_cfg (line 283) | def register_hook_from_cfg(self, hook_cfg):
    method call_hook (line 299) | def call_hook(self, fn_name):
    method get_hook_info (line 309) | def get_hook_info(self):
    method load_checkpoint (line 332) | def load_checkpoint(self,
    method resume (line 345) | def resume(self,
    method register_lr_hook (line 399) | def register_lr_hook(self, lr_config):
    method register_momentum_hook (line 420) | def register_momentum_hook(self, momentum_config):
    method register_optimizer_hook (line 441) | def register_optimizer_hook(self, optimizer_config):
    method register_checkpoint_hook (line 451) | def register_checkpoint_hook(self, checkpoint_config):
    method register_logger_hooks (line 461) | def register_logger_hooks(self, log_config):
    method register_timer_hook (line 470) | def register_timer_hook(self, timer_config):
    method register_custom_hooks (line 480) | def register_custom_hooks(self, custom_config):
    method register_profiler_hook (line 493) | def register_profiler_hook(self, profiler_config):
    method register_training_hooks (line 503) | def register_training_hooks(self,

FILE: lavis/common/annotator/uniformer/mmcv/runner/builder.py
  function build_runner_constructor (line 10) | def build_runner_constructor(cfg):
  function build_runner (line 14) | def build_runner(cfg, default_args=None):

FILE: lavis/common/annotator/uniformer/mmcv/runner/checkpoint.py
  function _get_mmcv_home (line 30) | def _get_mmcv_home():
  function load_state_dict (line 41) | def load_state_dict(module, state_dict, strict=False, logger=None):
  function get_torchvision_models (line 109) | def get_torchvision_models():
  function get_external_models (line 121) | def get_external_models():
  function get_mmcls_models (line 135) | def get_mmcls_models():
  function get_deprecated_model_names (line 142) | def get_deprecated_model_names():
  function _process_mmcls_checkpoint (line 151) | def _process_mmcls_checkpoint(checkpoint):
  class CheckpointLoader (line 162) | class CheckpointLoader:
    method _register_scheme (line 168) | def _register_scheme(cls, prefixes, loader, force=False):
    method register_scheme (line 185) | def register_scheme(cls, prefixes, loader=None, force=False):
    method _get_checkpoint_loader (line 211) | def _get_checkpoint_loader(cls, path):
    method load_checkpoint (line 227) | def load_checkpoint(cls, filename, map_location=None, logger=None):
  function load_from_local (line 249) | def load_from_local(filename, map_location):
  function load_from_http (line 267) | def load_from_http(filename, map_location=None, model_dir=None):
  function load_from_pavi (line 295) | def load_from_pavi(filename, map_location=None):
  function load_from_ceph (line 327) | def load_from_ceph(filename, map_location=None, backend='petrel'):
  function load_from_torchvision (line 368) | def load_from_torchvision(filename, map_location=None):
  function load_from_openmmlab (line 391) | def load_from_openmmlab(filename, map_location=None):
  function load_from_mmcls (line 431) | def load_from_mmcls(filename, map_location=None):
  function _load_checkpoint (line 450) | def _load_checkpoint(filename, map_location=None, logger=None):
  function _load_checkpoint_with_prefix (line 470) | def _load_checkpoint_with_prefix(prefix, filename, map_location=None):
  function load_checkpoint (line 503) | def load_checkpoint(model,
  function weights_to_cpu (line 553) | def weights_to_cpu(state_dict):
  function _save_to_state_dict (line 570) | def _save_to_state_dict(module, destination, prefix, keep_vars):
  function get_state_dict (line 590) | def get_state_dict(module, destination=None, prefix='', keep_vars=False):
  function save_checkpoint (line 634) | def save_checkpoint(model,

FILE: lavis/common/annotator/uniformer/mmcv/runner/default_constructor.py
  class DefaultRunnerConstructor (line 5) | class DefaultRunnerConstructor:
    method __init__ (line 36) | def __init__(self, runner_cfg, default_args=None):
    method __call__ (line 43) | def __call__(self):

FILE: lavis/common/annotator/uniformer/mmcv/runner/dist_utils.py
  function init_dist (line 14) | def init_dist(launcher, backend='nccl', **kwargs):
  function _init_dist_pytorch (line 27) | def _init_dist_pytorch(backend, **kwargs):
  function _init_dist_mpi (line 35) | def _init_dist_mpi(backend, **kwargs):
  function _init_dist_slurm (line 43) | def _init_dist_slurm(backend, port=None):
  function get_dist_info (line 78) | def get_dist_info():
  function master_only (line 88) | def master_only(func):
  function allreduce_params (line 99) | def allreduce_params(params, coalesce=True, bucket_size_mb=-1):
  function allreduce_grads (line 121) | def allreduce_grads(params, coalesce=True, bucket_size_mb=-1):
  function _allreduce_coalesced (line 145) | def _allreduce_coalesced(tensors, world_size, bucket_size_mb=-1):

FILE: lavis/common/annotator/uniformer/mmcv/runner/epoch_based_runner.py
  class EpochBasedRunner (line 18) | class EpochBasedRunner(BaseRunner):
    method run_iter (line 24) | def run_iter(self, data_batch, train_mode, **kwargs):
    method train (line 40) | def train(self, data_loader, **kwargs):
    method val (line 58) | def val(self, data_loader, **kwargs):
    method run (line 72) | def run(self, data_loaders, workflow, max_epochs=None, **kwargs):
    method save_checkpoint (line 132) | def save_checkpoint(self,
  class Runner (line 181) | class Runner(EpochBasedRunner):
    method __init__ (line 184) | def __init__(self, *args, **kwargs):

FILE: lavis/common/annotator/uniformer/mmcv/runner/fp16_utils.py
  function cast_tensor_type (line 24) | def cast_tensor_type(inputs, src_type, dst_type):
  function auto_fp16 (line 55) | def auto_fp16(apply_to=None, out_fp32=False):
  function force_fp32 (line 141) | def force_fp32(apply_to=None, out_fp16=False):
  function allreduce_grads (line 227) | def allreduce_grads(params, coalesce=True, bucket_size_mb=-1):
  function wrap_fp16_model (line 234) | def wrap_fp16_model(model):
  function patch_norm_fp32 (line 263) | def patch_norm_fp32(module):
  function patch_forward_method (line 283) | def patch_forward_method(func, src_type, dst_type, convert_output=True):
  class LossScaler (line 306) | class LossScaler:
    method __init__ (line 335) | def __init__(self,
    method has_overflow (line 349) | def has_overflow(self, params):
    method _has_inf_or_nan (line 358) | def _has_inf_or_nan(x):
    method update_scale (line 372) | def update_scale(self, overflow):
    method state_dict (line 385) | def state_dict(self):
    method load_state_dict (line 395) | def load_state_dict(self, state_dict):
    method loss_scale (line 409) | def loss_scale(self):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/checkpoint.py
  class CheckpointHook (line 11) | class CheckpointHook(Hook):
    method __init__ (line 51) | def __init__(self,
    method before_run (line 71) | def before_run(self, runner):
    method after_train_epoch (line 102) | def after_train_epoch(self, runner):
    method _save_checkpoint (line 119) | def _save_checkpoint(self, runner):
    method after_train_iter (line 153) | def after_train_iter(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/closure.py
  class ClosureHook (line 6) | class ClosureHook(Hook):
    method __init__ (line 8) | def __init__(self, fn_name, fn):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/ema.py
  class EMAHook (line 7) | class EMAHook(Hook):
    method __init__ (line 29) | def __init__(self,
    method before_run (line 41) | def before_run(self, runner):
    method after_train_iter (line 60) | def after_train_iter(self, runner):
    method after_train_epoch (line 73) | def after_train_epoch(self, runner):
    method before_train_epoch (line 78) | def before_train_epoch(self, runner):
    method _swap_ema_parameters (line 83) | def _swap_ema_parameters(self):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/evaluation.py
  class EvalHook (line 16) | class EvalHook(Hook):
    method __init__ (line 85) | def __init__(self,
    method _init_rule (line 153) | def _init_rule(self, rule, key_indicator):
    method before_run (line 202) | def before_run(self, runner):
    method before_train_iter (line 228) | def before_train_iter(self, runner):
    method before_train_epoch (line 236) | def before_train_epoch(self, runner):
    method after_train_iter (line 244) | def after_train_iter(self, runner):
    method after_train_epoch (line 264) | def after_train_epoch(self, runner):
    method _do_evaluate (line 269) | def _do_evaluate(self, runner):
    method _should_evaluate (line 279) | def _should_evaluate(self, runner):
    method _save_ckpt (line 314) | def _save_ckpt(self, runner, key_score):
    method evaluate (line 354) | def evaluate(self, runner, results):
  class DistEvalHook (line 387) | class DistEvalHook(EvalHook):
    method __init__ (line 439) | def __init__(self,
    method _do_evaluate (line 478) | def _do_evaluate(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/hook.py
  class Hook (line 7) | class Hook:
    method before_run (line 13) | def before_run(self, runner):
    method after_run (line 16) | def after_run(self, runner):
    method before_epoch (line 19) | def before_epoch(self, runner):
    method after_epoch (line 22) | def after_epoch(self, runner):
    method before_iter (line 25) | def before_iter(self, runner):
    method after_iter (line 28) | def after_iter(self, runner):
    method before_train_epoch (line 31) | def before_train_epoch(self, runner):
    method before_val_epoch (line 34) | def before_val_epoch(self, runner):
    method after_train_epoch (line 37) | def after_train_epoch(self, runner):
    method after_val_epoch (line 40) | def after_val_epoch(self, runner):
    method before_train_iter (line 43) | def before_train_iter(self, runner):
    method before_val_iter (line 46) | def before_val_iter(self, runner):
    method after_train_iter (line 49) | def after_train_iter(self, runner):
    method after_val_iter (line 52) | def after_val_iter(self, runner):
    method every_n_epochs (line 55) | def every_n_epochs(self, runner, n):
    method every_n_inner_iters (line 58) | def every_n_inner_iters(self, runner, n):
    method every_n_iters (line 61) | def every_n_iters(self, runner, n):
    method end_of_epoch (line 64) | def end_of_epoch(self, runner):
    method is_last_epoch (line 67) | def is_last_epoch(self, runner):
    method is_last_iter (line 70) | def is_last_iter(self, runner):
    method get_triggered_stages (line 73) | def get_triggered_stages(self):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/iter_timer.py
  class IterTimerHook (line 8) | class IterTimerHook(Hook):
    method before_epoch (line 10) | def before_epoch(self, runner):
    method before_iter (line 13) | def before_iter(self, runner):
    method after_iter (line 16) | def after_iter(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/base.py
  class LoggerHook (line 11) | class LoggerHook(Hook):
    method __init__ (line 24) | def __init__(self,
    method log (line 35) | def log(self, runner):
    method is_scalar (line 39) | def is_scalar(val, include_np=True, include_torch=True):
    method get_mode (line 59) | def get_mode(self, runner):
    method get_epoch (line 72) | def get_epoch(self, runner):
    method get_iter (line 84) | def get_iter(self, runner, inner_iter=False):
    method get_lr_tags (line 92) | def get_lr_tags(self, runner):
    method get_momentum_tags (line 102) | def get_momentum_tags(self, runner):
    method get_loggable_tags (line 112) | def get_loggable_tags(self,
    method before_run (line 133) | def before_run(self, runner):
    method before_epoch (line 139) | def before_epoch(self, runner):
    method after_train_iter (line 142) | def after_train_iter(self, runner):
    method after_train_epoch (line 156) | def after_train_epoch(self, runner):
    method after_val_epoch (line 162) | def after_val_epoch(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/dvclive.py
  class DvcliveLoggerHook (line 8) | class DvcliveLoggerHook(LoggerHook):
    method __init__ (line 29) | def __init__(self,
    method import_dvclive (line 41) | def import_dvclive(self):
    method before_run (line 50) | def before_run(self, runner):
    method log (line 54) | def log(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/mlflow.py
  class MlflowLoggerHook (line 8) | class MlflowLoggerHook(LoggerHook):
    method __init__ (line 10) | def __init__(self,
    method import_mlflow (line 51) | def import_mlflow(self):
    method before_run (line 62) | def before_run(self, runner):
    method log (line 70) | def log(self, runner):
    method after_run (line 76) | def after_run(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/neptune.py
  class NeptuneLoggerHook (line 8) | class NeptuneLoggerHook(LoggerHook):
    method __init__ (line 38) | def __init__(self,
    method import_neptune (line 52) | def import_neptune(self):
    method before_run (line 62) | def before_run(self, runner):
    method log (line 69) | def log(self, runner):
    method after_run (line 81) | def after_run(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/pavi.py
  class PaviLoggerHook (line 17) | class PaviLoggerHook(LoggerHook):
    method __init__ (line 19) | def __init__(self,
    method before_run (line 36) | def before_run(self, runner):
    method get_step (line 74) | def get_step(self, runner):
    method log (line 82) | def log(self, runner):
    method after_run (line 89) | def after_run(self, runner):
    method before_epoch (line 107) | def before_epoch(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/tensorboard.py
  class TensorboardLoggerHook (line 11) | class TensorboardLoggerHook(LoggerHook):
    method __init__ (line 13) | def __init__(self,
    method before_run (line 24) | def before_run(self, runner):
    method log (line 47) | def log(self, runner):
    method after_run (line 56) | def after_run(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/text.py
  class TextLoggerHook (line 18) | class TextLoggerHook(LoggerHook):
    method __init__ (line 55) | def __init__(self,
    method before_run (line 89) | def before_run(self, runner):
    method _get_max_memory (line 109) | def _get_max_memory(self, runner):
    method _log_info (line 119) | def _log_info(self, log_dict, runner):
    method _dump_log (line 185) | def _dump_log(self, log_dict, runner):
    method _round_float (line 196) | def _round_float(self, items):
    method log (line 204) | def log(self, runner):
    method after_run (line 238) | def after_run(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/logger/wandb.py
  class WandbLoggerHook (line 8) | class WandbLoggerHook(LoggerHook):
    method __init__ (line 10) | def __init__(self,
    method import_wandb (line 25) | def import_wandb(self):
    method before_run (line 34) | def before_run(self, runner):
    method log (line 44) | def log(self, runner):
    method after_run (line 55) | def after_run(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/lr_updater.py
  class LrUpdaterHook (line 9) | class LrUpdaterHook(Hook):
    method __init__ (line 25) | def __init__(self,
    method _set_lr (line 58) | def _set_lr(self, runner, lr_groups):
    method get_lr (line 68) | def get_lr(self, runner, base_lr):
    method get_regular_lr (line 71) | def get_regular_lr(self, runner):
    method get_warmup_lr (line 85) | def get_warmup_lr(self, cur_iters):
    method before_run (line 107) | def before_run(self, runner):
    method before_train_epoch (line 126) | def before_train_epoch(self, runner):
    method before_train_iter (line 137) | def before_train_iter(self, runner):
  class FixedLrUpdaterHook (line 157) | class FixedLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 159) | def __init__(self, **kwargs):
    method get_lr (line 162) | def get_lr(self, runner, base_lr):
  class StepLrUpdaterHook (line 167) | class StepLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 180) | def __init__(self, step, gamma=0.1, min_lr=None, **kwargs):
    method get_lr (line 193) | def get_lr(self, runner, base_lr):
  class ExpLrUpdaterHook (line 214) | class ExpLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 216) | def __init__(self, gamma, **kwargs):
    method get_lr (line 220) | def get_lr(self, runner, base_lr):
  class PolyLrUpdaterHook (line 226) | class PolyLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 228) | def __init__(self, power=1., min_lr=0., **kwargs):
    method get_lr (line 233) | def get_lr(self, runner, base_lr):
  class InvLrUpdaterHook (line 245) | class InvLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 247) | def __init__(self, gamma, power=1., **kwargs):
    method get_lr (line 252) | def get_lr(self, runner, base_lr):
  class CosineAnnealingLrUpdaterHook (line 258) | class CosineAnnealingLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 260) | def __init__(self, min_lr=None, min_lr_ratio=None, **kwargs):
    method get_lr (line 266) | def get_lr(self, runner, base_lr):
  class FlatCosineAnnealingLrUpdaterHook (line 282) | class FlatCosineAnnealingLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 298) | def __init__(self,
    method get_lr (line 314) | def get_lr(self, runner, base_lr):
  class CosineRestartLrUpdaterHook (line 336) | class CosineRestartLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 349) | def __init__(self,
    method get_lr (line 368) | def get_lr(self, runner, base_lr):
  function get_position_from_periods (line 388) | def get_position_from_periods(iteration, cumulative_periods):
  class CyclicLrUpdaterHook (line 412) | class CyclicLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 434) | def __init__(self,
    method before_run (line 472) | def before_run(self, runner):
    method get_lr (line 485) | def get_lr(self, runner, base_lr):
  class OneCycleLrUpdaterHook (line 498) | class OneCycleLrUpdaterHook(LrUpdaterHook):
    method __init__ (line 532) | def __init__(self,
    method before_run (line 575) | def before_run(self, runner):
    method get_lr (line 614) | def get_lr(self, runner, base_lr):
  function annealing_cos (line 627) | def annealing_cos(start, end, factor, weight=1):
  function annealing_linear (line 645) | def annealing_linear(start, end, factor):
  function format_param (line 659) | def format_param(name, optim, param):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/memory.py
  class EmptyCacheHook (line 8) | class EmptyCacheHook(Hook):
    method __init__ (line 10) | def __init__(self, before_epoch=False, after_epoch=True, after_iter=Fa...
    method after_iter (line 15) | def after_iter(self, runner):
    method before_epoch (line 19) | def before_epoch(self, runner):
    method after_epoch (line 23) | def after_epoch(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/momentum_updater.py
  class MomentumUpdaterHook (line 7) | class MomentumUpdaterHook(Hook):
    method __init__ (line 9) | def __init__(self,
    method _set_momentum (line 35) | def _set_momentum(self, runner, momentum_groups):
    method get_momentum (line 52) | def get_momentum(self, runner, base_momentum):
    method get_regular_momentum (line 55) | def get_regular_momentum(self, runner):
    method get_warmup_momentum (line 71) | def get_warmup_momentum(self, cur_iters):
    method before_run (line 101) | def before_run(self, runner):
    method before_train_epoch (line 128) | def before_train_epoch(self, runner):
    method before_train_iter (line 134) | def before_train_iter(self, runner):
  class StepMomentumUpdaterHook (line 154) | class StepMomentumUpdaterHook(MomentumUpdaterHook):
    method __init__ (line 168) | def __init__(self, step, gamma=0.5, min_momentum=None, **kwargs):
    method get_momentum (line 181) | def get_momentum(self, runner, base_momentum):
  class CosineAnnealingMomentumUpdaterHook (line 202) | class CosineAnnealingMomentumUpdaterHook(MomentumUpdaterHook):
    method __init__ (line 204) | def __init__(self, min_momentum=None, min_momentum_ratio=None, **kwargs):
    method get_momentum (line 210) | def get_momentum(self, runner, base_momentum):
  class CyclicMomentumUpdaterHook (line 226) | class CyclicMomentumUpdaterHook(MomentumUpdaterHook):
    method __init__ (line 244) | def __init__(self,
    method before_run (line 273) | def before_run(self, runner):
    method get_momentum (line 286) | def get_momentum(self, runner, base_momentum):
  class OneCycleMomentumUpdaterHook (line 299) | class OneCycleMomentumUpdaterHook(MomentumUpdaterHook):
    method __init__ (line 333) | def __init__(self,
    method before_run (line 371) | def before_run(self, runner):
    method _set_momentum (line 448) | def _set_momentum(self, runner, momentum_groups):
    method get_momentum (line 465) | def get_momentum(self, runner, param_group):
    method get_regular_momentum (line 479) | def get_regular_momentum(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/optimizer.py
  class OptimizerHook (line 22) | class OptimizerHook(Hook):
    method __init__ (line 24) | def __init__(self, grad_clip=None):
    method clip_grads (line 27) | def clip_grads(self, params):
    method after_train_iter (line 33) | def after_train_iter(self, runner):
  class GradientCumulativeOptimizerHook (line 46) | class GradientCumulativeOptimizerHook(OptimizerHook):
    method __init__ (line 64) | def __init__(self, cumulative_iters=1, **kwargs):
    method has_batch_norm (line 76) | def has_batch_norm(self, module):
    method _init (line 84) | def _init(self, runner):
    method after_train_iter (line 105) | def after_train_iter(self, runner):
  class Fp16OptimizerHook (line 134) | class Fp16OptimizerHook(OptimizerHook):
    method __init__ (line 162) | def __init__(self,
    method before_run (line 184) | def before_run(self, runner):
    method copy_grads_to_fp32 (line 193) | def copy_grads_to_fp32(self, fp16_net, fp32_weights):
    method copy_params_to_fp16 (line 203) | def copy_params_to_fp16(self, fp16_net, fp32_weights):
    method after_train_iter (line 209) | def after_train_iter(self, runner):
    method __init__ (line 318) | def __init__(self,
    method before_run (line 339) | def before_run(self, runner):
    method copy_grads_to_fp32 (line 367) | def copy_grads_to_fp32(self, fp16_net, fp32_weights):
    method copy_params_to_fp16 (line 377) | def copy_params_to_fp16(self, fp16_net, fp32_weights):
    method after_train_iter (line 383) | def after_train_iter(self, runner):
  class GradientCumulativeFp16OptimizerHook (line 242) | class GradientCumulativeFp16OptimizerHook(GradientCumulativeOptimizerHook,
    method __init__ (line 251) | def __init__(self, *args, **kwargs):
    method after_train_iter (line 255) | def after_train_iter(self, runner):
    method __init__ (line 444) | def __init__(self, *args, **kwargs):
    method after_train_iter (line 448) | def after_train_iter(self, runner):
  class Fp16OptimizerHook (line 297) | class Fp16OptimizerHook(OptimizerHook):
    method __init__ (line 162) | def __init__(self,
    method before_run (line 184) | def before_run(self, runner):
    method copy_grads_to_fp32 (line 193) | def copy_grads_to_fp32(self, fp16_net, fp32_weights):
    method copy_params_to_fp16 (line 203) | def copy_params_to_fp16(self, fp16_net, fp32_weights):
    method after_train_iter (line 209) | def after_train_iter(self, runner):
    method __init__ (line 318) | def __init__(self,
    method before_run (line 339) | def before_run(self, runner):
    method copy_grads_to_fp32 (line 367) | def copy_grads_to_fp32(self, fp16_net, fp32_weights):
    method copy_params_to_fp16 (line 377) | def copy_params_to_fp16(self, fp16_net, fp32_weights):
    method after_train_iter (line 383) | def after_train_iter(self, runner):
  class GradientCumulativeFp16OptimizerHook (line 439) | class GradientCumulativeFp16OptimizerHook(GradientCumulativeOptimizerHook,
    method __init__ (line 251) | def __init__(self, *args, **kwargs):
    method after_train_iter (line 255) | def after_train_iter(self, runner):
    method __init__ (line 444) | def __init__(self, *args, **kwargs):
    method after_train_iter (line 448) | def after_train_iter(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/profiler.py
  class ProfilerHook (line 12) | class ProfilerHook(Hook):
    method __init__ (line 55) | def __init__(self,
    method before_run (line 107) | def before_run(self, runner):
    method after_train_epoch (line 166) | def after_train_epoch(self, runner):
    method after_train_iter (line 174) | def after_train_iter(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/sampler_seed.py
  class DistSamplerSeedHook (line 6) | class DistSamplerSeedHook(Hook):
    method before_epoch (line 14) | def before_epoch(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/hooks/sync_buffer.py
  class SyncBuffersHook (line 7) | class SyncBuffersHook(Hook):
    method __init__ (line 16) | def __init__(self, distributed=True):
    method after_epoch (line 19) | def after_epoch(self, runner):

FILE: lavis/common/annotator/uniformer/mmcv/runner/iter_based_runner.py
  class IterLoader (line 19) | class IterLoader:
    method __init__ (line 21) | def __init__(self, dataloader):
    method epoch (line 27) | def epoch(self):
    method __next__ (line 30) | def __next__(self):
    method __len__ (line 43) | def __len__(self):
  class IterBasedRunner (line 48) | class IterBasedRunner(BaseRunner):
    method train (line 54) | def train(self, data_loader, **kwargs):
    method val (line 72) | def val(self, data_loader, **kwargs):
    method run (line 87) | def run(self, data_loaders, workflow, max_iters=None, **kwargs):
    method resume (line 140) | def resume(self,
    method save_checkpoint (line 179) | def save_checkpoint(self,
    method register_training_hooks (line 224) | def register_training_hooks(self,

FILE: lavis/common/annotator/uniformer/mmcv/runner/log_buffer.py
  class LogBuffer (line 7) | class LogBuffer:
    method __init__ (line 9) | def __init__(self):
    method clear (line 15) | def clear(self):
    method clear_output (line 20) | def clear_output(self):
    method update (line 24) | def update(self, vars, count=1):
    method average (line 33) | def average(self, n=0):

FILE: lavis/common/annotator/uniformer/mmcv/runner/optimizer/builder.py
  function register_torch_optimizers (line 13) | def register_torch_optimizers():
  function build_optimizer_constructor (line 29) | def build_optimizer_constructor(cfg):
  function build_optimizer (line 33) | def build_optimizer(model, cfg):

FILE: lavis/common/annotator/uniformer/mmcv/runner/optimizer/default_constructor.py
  class DefaultOptimizerConstructor (line 13) | class DefaultOptimizerConstructor:
    method __init__ (line 95) | def __init__(self, optimizer_cfg, paramwise_cfg=None):
    method _validate_cfg (line 105) | def _validate_cfg(self):
    method _is_in (line 128) | def _is_in(self, param_group, param_group_list):
    method add_params (line 137) | def add_params(self, params, module, prefix='', is_dcn_module=None):
    method __call__ (line 234) | def __call__(self, model):

FILE: lavis/common/annotator/uniformer/mmcv/runner/priority.py
  class Priority (line 5) | class Priority(Enum):
  function get_priority (line 42) | def get_priority(priority):

FILE: lavis/common/annotator/uniformer/mmcv/runner/utils.py
  function get_host_info (line 16) | def get_host_info():
  function get_time_str (line 31) | def get_time_str():
  function obj_from_dict (line 35) | def obj_from_dict(info, parent=None, default_args=None):
  function set_random_seed (line 70) | def set_random_seed(seed, deterministic=False, use_rank_shift=False):

FILE: lavis/common/annotator/uniformer/mmcv/utils/config.py
  class ConfigDict (line 33) | class ConfigDict(Dict):
    method __missing__ (line 35) | def __missing__(self, name):
    method __getattr__ (line 38) | def __getattr__(self, name):
  function add_args (line 51) | def add_args(parser, cfg, prefix=''):
  class Config (line 70) | class Config:
    method _validate_py_syntax (line 96) | def _validate_py_syntax(filename):
    method _substitute_predefined_vars (line 107) | def _substitute_predefined_vars(filename, temp_config_name):
    method _pre_substitute_base_vars (line 128) | def _pre_substitute_base_vars(filename, temp_config_name):
    method _substitute_base_vars (line 147) | def _substitute_base_vars(cfg, base_var_dict, base_cfg):
    method _file2dict (line 179) | def _file2dict(filename, use_predefined_variables=True):
    method _merge_a_into_b (line 274) | def _merge_a_into_b(a, b, allow_list_keys=False):
    method fromfile (line 328) | def fromfile(filename,
    method fromstring (line 338) | def fromstring(cfg_str, file_format):
    method auto_argparser (line 366) | def auto_argparser(description=None):
    method __init__ (line 377) | def __init__(self, cfg_dict=None, cfg_text=None, filename=None):
    method filename (line 399) | def filename(self):
    method text (line 403) | def text(self):
    method pretty_text (line 407) | def pretty_text(self):
    method __repr__ (line 500) | def __repr__(self):
    method __len__ (line 503) | def __len__(self):
    method __getattr__ (line 506) | def __getattr__(self, name):
    method __getitem__ (line 509) | def __getitem__(self, name):
    method __setattr__ (line 512) | def __setattr__(self, name, value):
    method __setitem__ (line 517) | def __setitem__(self, name, value):
    method __iter__ (line 522) | def __iter__(self):
    method __getstate__ (line 525) | def __getstate__(self):
    method __setstate__ (line 528) | def __setstate__(self, state):
    method dump (line 534) | def dump(self, file=None):
    method merge_from_dict (line 550) | def merge_from_dict(self, options, allow_list_keys=True):
  class DictAction (line 597) | class DictAction(Action):
    method _parse_int_float_bool (line 607) | def _parse_int_float_bool(val):
    method _parse_iterable (line 621) | def _parse_iterable(val):
    method __call__ (line 683) | def __call__(self, parser, namespace, values, option_string=None):

FILE: lavis/common/annotator/uniformer/mmcv/utils/env.py
  function collect_env (line 16) | def collect_env():

FILE: lavis/common/annotator/uniformer/mmcv/utils/ext_loader.py
  function load_ext (line 12) | def load_ext(name, funcs):
  function get_fake_func (line 41) | def get_fake_func(name, e):
  function load_ext (line 49) | def load_ext(name, funcs):
  function check_ops_exist (line 69) | def check_ops_exist():

FILE: lavis/common/annotator/uniformer/mmcv/utils/logging.py
  function get_logger (line 9) | def get_logger(name, log_file=None, log_level=logging.INFO, file_mode='w'):
  function print_log (line 85) | def print_log(msg, logger=None, level=logging.INFO):

FILE: lavis/common/annotator/uniformer/mmcv/utils/misc.py
  function _ntuple (line 14) | def _ntuple(n):
  function is_str (line 31) | def is_str(x):
  function import_modules_from_strings (line 39) | def import_modules_from_strings(imports, allow_failed_imports=False):
  function iter_cast (line 87) | def iter_cast(inputs, dst_type, return_type=None):
  function list_cast (line 112) | def list_cast(inputs, dst_type):
  function tuple_cast (line 120) | def tuple_cast(inputs, dst_type):
  function is_seq_of (line 128) | def is_seq_of(seq, expected_type, seq_type=None):
  function is_list_of (line 152) | def is_list_of(seq, expected_type):
  function is_tuple_of (line 160) | def is_tuple_of(seq, expected_type):
  function slice_list (line 168) | def slice_list(in_list, lens):
  function concat_list (line 194) | def concat_list(in_list):
  function check_prerequisites (line 206) | def check_prerequisites(
  function _check_py_package (line 244) | def _check_py_package(package):
  function _check_executable (line 253) | def _check_executable(cmd):
  function requires_package (line 260) | def requires_package(prerequisites):
  function requires_executable (line 276) | def requires_executable(prerequisites):
  function deprecated_api_warning (line 288) | def deprecated_api_warning(name_dict, cls_name=None):
  function is_method_overridden (line 348) | def is_method_overridden(method, base_class, derived_class):
  function has_method (line 367) | def has_method(obj: object, method: str) -> bool:

FILE: lavis/common/annotator/uniformer/mmcv/utils/parrots_jit.py
  function jit (line 12) | def jit(func=None,
  function skip_no_elena (line 36) | def skip_no_elena(func):

FILE: lavis/common/annotator/uniformer/mmcv/utils/parrots_wrapper.py
  function is_rocm_pytorch (line 9) | def is_rocm_pytorch() -> bool:
  function _get_cuda_home (line 21) | def _get_cuda_home():
  function get_build_config (line 33) | def get_build_config():
  function _get_conv (line 41) | def _get_conv():
  function _get_dataloader (line 49) | def _get_dataloader():
  function _get_extension (line 58) | def _get_extension():
  function _get_pool (line 69) | def _get_pool():
  function _get_norm (line 81) | def _get_norm():
  class SyncBatchNorm (line 99) | class SyncBatchNorm(SyncBatchNorm_):
    method _check_input_dim (line 101) | def _check_input_dim(self, input):

FILE: lavis/common/annotator/uniformer/mmcv/utils/path.py
  function is_filepath (line 9) | def is_filepath(x):
  function fopen (line 13) | def fopen(filepath, *args, **kwargs):
  function check_file_exist (line 21) | def check_file_exist(filename, msg_tmpl='file "{}" does not exist'):
  function mkdir_or_exist (line 26) | def mkdir_or_exist(dir_name, mode=0o777):
  function symlink (line 33) | def symlink(src, dst, overwrite=True, **kwargs):
  function scandir (line 39) | def scandir(dir_path, suffix=None, recursive=False, case_sensitive=True):
  function find_vcs_root (line 83) | def find_vcs_root(path, markers=('.git', )):

FILE: lavis/common/annotator/uniformer/mmcv/utils/progressbar.py
  class ProgressBar (line 10) | class ProgressBar:
    method __init__ (line 13) | def __init__(self, task_num=0, bar_width=50, start=True, file=sys.stdo...
    method terminal_width (line 22) | def terminal_width(self):
    method start (line 26) | def start(self):
    method update (line 35) | def update(self, num_tasks=1):
  function track_progress (line 64) | def track_progress(func, tasks, bar_width=50, file=sys.stdout, **kwargs):
  function init_pool (line 98) | def init_pool(process_num, initializer=None, initargs=None):
  function track_parallel_progress (line 109) | def track_parallel_progress(func,
  function track_iter_progress (line 179) | def track_iter_progress(tasks, bar_width=50, file=sys.stdout):

FILE: lavis/common/annotator/uniformer/mmcv/utils/registry.py
  function build_from_cfg (line 9) | def build_from_cfg(cfg, registry, default_args=None):
  class Registry (line 58) | class Registry:
    method __init__ (line 88) | def __init__(self, name, build_func=None, parent=None, scope=None):
    method __len__ (line 112) | def __len__(self):
    method __contains__ (line 115) | def __contains__(self, key):
    method __repr__ (line 118) | def __repr__(self):
    method infer_scope (line 125) | def infer_scope():
    method split_scope_key (line 149) | def split_scope_key(key):
    method name (line 171) | def name(self):
    method scope (line 175) | def scope(self):
    method module_dict (line 179) | def module_dict(self):
    method children (line 183) | def children(self):
    method get (line 186) | def get(self, key):
    method build (line 211) | def build(self, *args, **kwargs):
    method _add_children (line 214) | def _add_children(self, registry):
    method _register_module (line 235) | def _register_module(self, module_class, module_name=None, force=False):
    method deprecated_register_module (line 250) | def deprecated_register_module(self, cls=None, force=False):
    method register_module (line 260) | def register_module(self, name=None, force=False, module=None):

FILE: lavis/common/annotator/uniformer/mmcv/utils/testing.py
  function check_python_script (line 10) | def check_python_script(cmd):
  function _any (line 25) | def _any(judge_result):
  function assert_dict_contains_subset (line 42) | def assert_dict_contains_subset(dict_obj: Dict[Any, Any],
  function assert_attrs_equal (line 61) | def assert_attrs_equal(obj: Any, expected_attrs: Dict[str, Any]) -> bool:
  function assert_dict_has_keys (line 77) | def assert_dict_has_keys(obj: Dict[str, Any],
  function assert_keys_equal (line 92) | def assert_keys_equal(result_keys: List[str], target_keys: List[str]) ->...
  function assert_is_norm_layer (line 105) | def assert_is_norm_layer(module) -> bool:
  function assert_params_all_zeros (line 120) | def assert_params_all_zeros(module) -> bool:

FILE: lavis/common/annotator/uniformer/mmcv/utils/timer.py
  class TimerError (line 5) | class TimerError(Exception):
    method __init__ (line 7) | def __init__(self, message):
  class Timer (line 12) | class Timer:
    method __init__ (line 38) | def __init__(self, start=True, print_tmpl=None):
    method is_running (line 45) | def is_running(self):
    method __enter__ (line 49) | def __enter__(self):
    method __exit__ (line 53) | def __exit__(self, type, value, traceback):
    method start (line 57) | def start(self):
    method since_start (line 64) | def since_start(self):
    method since_last_check (line 74) | def since_last_check(self):
  function check_time (line 92) | def check_time(timer_id):

FILE: lavis/common/annotator/uniformer/mmcv/utils/trace.py
  function is_jit_tracing (line 8) | def is_jit_tracing() -> bool:

FILE: lavis/common/annotator/uniformer/mmcv/utils/version_utils.py
  function digit_version (line 9) | def digit_version(version_str: str, length: int = 4):
  function _minimal_ext_cmd (line 50) | def _minimal_ext_cmd(cmd):
  function get_git_hash (line 66) | def get_git_hash(fallback='unknown', digits=None):

FILE: lavis/common/annotator/uniformer/mmcv/version.py
  function parse_version_info (line 5) | def parse_version_info(version_str: str, length: int = 4) -> tuple:

FILE: lavis/common/annotator/uniformer/mmcv/video/io.py
  class Cache (line 14) | class Cache:
    method __init__ (line 16) | def __init__(self, capacity):
    method capacity (line 23) | def capacity(self):
    method size (line 27) | def size(self):
    method put (line 30) | def put(self, key, val):
    method get (line 37) | def get(self, key, default=None):
  class VideoReader (line 42) | class VideoReader:
    method __init__ (line 64) | def __init__(self, filename, cache_capacity=10):
    method vcap (line 80) | def vcap(self):
    method opened (line 85) | def opened(self):
    method width (line 90) | def width(self):
    method height (line 95) | def height(self):
    method resolution (line 100) | def resolution(self):
    method fps (line 105) | def fps(self):
    method frame_cnt (line 110) | def frame_cnt(self):
    method fourcc (line 115) | def fourcc(self):
    method position (line 120) | def position(self):
    method _get_real_position (line 124) | def _get_real_position(self):
    method _set_real_position (line 127) | def _set_real_position(self, frame_id):
    method read (line 134) | def read(self):
    method get_frame (line 160) | def get_frame(self, frame_id):
    method current_frame (line 187) | def current_frame(self):
    method cvt2frames (line 198) | def cvt2frames(self,
    method __len__ (line 240) | def __len__(self):
    method __getitem__ (line 243) | def __getitem__(self, index):
    method __iter__ (line 256) | def __iter__(self):
    method __next__ (line 260) | def __next__(self):
    method __enter__ (line 269) | def __enter__(self):
    method __exit__ (line 272) | def __exit__(self, exc_type, exc_value, traceback):
  function frames2video (line 276) | def frames2video(frame_dir,

FILE: lavis/common/annotator/uniformer/mmcv/video/optflow.py
  function flowread (line 12) | def flowread(flow_or_path, quantize=False, concat_axis=0, *args, **kwargs):
  function flowwrite (line 61) | def flowwrite(flow, filename, quantize=False, concat_axis=0, *args, **kw...
  function quantize_flow (line 91) | def quantize_flow(flow, max_val=0.02, norm=True):
  function dequantize_flow (line 119) | def dequantize_flow(dx, dy, max_val=0.02, denorm=True):
  function flow_warp (line 143) | def flow_warp(img, flow, filling_value=0, interpolate_mode='nearest'):
  function flow_from_bytes (line 204) | def flow_from_bytes(content):
  function sparse_flow_from_bytes (line 234) | def sparse_flow_from_bytes(content):

FILE: lavis/common/annotator/uniformer/mmcv/video/processing.py
  function convert_video (line 11) | def convert_video(in_file,
  function resize_video (line 55) | def resize_video(in_file,
  function cut_video (line 93) | def cut_video(in_file,
  function concat_video (line 128) | def concat_video(video_list,

FILE: lavis/common/annotator/uniformer/mmcv/visualization/color.py
  class Color (line 9) | class Color(Enum):
  function color_val (line 24) | def color_val(color):

FILE: lavis/common/annotator/uniformer/mmcv/visualization/image.py
  function imshow (line 9) | def imshow(img, win_name='', wait_time=0):
  function imshow_bboxes (line 30) | def imshow_bboxes(img,
  function imshow_det_bboxes (line 84) | def imshow_det_bboxes(img,

FILE: lavis/common/annotator/uniformer/mmcv/visualization/optflow.py
  function flowshow (line 11) | def flowshow(flow, win_name='', wait_time=0):
  function flow2rgb (line 24) | def flow2rgb(flow, color_wheel=None, unknown_thr=1e6):
  function make_color_wheel (line 76) | def make_color_wheel(bins=None):

FILE: lavis/common/annotator/uniformer/mmcv_custom/checkpoint.py
  function _get_mmcv_home (line 30) | def _get_mmcv_home():
  function load_state_dict (line 41) | def load_state_dict(module, state_dict, strict=False, logger=None):
  function load_url_dist (line 109) | def load_url_dist(url, model_dir=None):
  function load_pavimodel_dist (line 123) | def load_pavimodel_dist(model_path, map_location=None):
  function load_fileclient_dist (line 151) | def load_fileclient_dist(filename, backend, map_location):
  function get_torchvision_models (line 172) | def get_torchvision_models():
  function get_external_models (line 184) | def get_external_models():
  function get_mmcls_models (line 198) | def get_mmcls_models():
  function get_deprecated_model_names (line 205) | def get_deprecated_model_names():
  function _process_mmcls_checkpoint (line 214) | def _process_mmcls_checkpoint(checkpoint):
  function _load_checkpoint (line 225) | def _load_checkpoint(filename, map_location=None):
  function load_checkpoint (line 286) | def load_checkpoint(model,
  function weights_to_cpu (line 359) | def weights_to_cpu(state_dict):
  function _save_to_state_dict (line 374) | def _save_to_state_dict(module, destination, prefix, keep_vars):
  function get_state_dict (line 394) | def get_state_dict(module, destination=None, prefix='', keep_vars=False):
  function save_checkpoint (line 438) | def save_checkpoint(model, filename, optimizer=None, meta=None):

FILE: lavis/common/annotator/uniformer/mmseg/apis/inference.py
  function init_segmentor (line 11) | def init_segmentor(config, checkpoint=None, device='cuda:0'):
  class LoadImage (line 42) | class LoadImage:
    method __call__ (line 45) | def __call__(self, results):
  function inference_segmentor (line 69) | def inference_segmentor(model, img):
  function show_result_pyplot (line 101) | def show_result_pyplot(model,

FILE: lavis/common/annotator/uniformer/mmseg/apis/test.py
  function np2tmp (line 14) | def np2tmp(array, temp_file_name=None):
  function single_gpu_test (line 34) | def single_gpu_test(model,
  function multi_gpu_test (line 106) | def multi_gpu_test(model,
  function collect_results_cpu (line 164) | def collect_results_cpu(result_part, size, tmpdir=None):
  function collect_results_gpu (line 207) | def collect_results_gpu(result_part, size):

FILE: lavis/common/annotator/uniformer/mmseg/apis/train.py
  function set_random_seed (line 14) | def set_random_seed(seed, deterministic=False):
  function train_segmentor (line 33) | def train_segmentor(model,

FILE: lavis/common/annotator/uniformer/mmseg/core/evaluation/class_names.py
  function cityscapes_classes (line 4) | def cityscapes_classes():
  function ade_classes (line 14) | def ade_classes():
  function voc_classes (line 44) | def voc_classes():
  function cityscapes_palette (line 54) | def cityscapes_palette():
  function ade_palette (line 63) | def ade_palette():
  function voc_palette (line 105) | def voc_palette():
  function get_classes (line 121) | def get_classes(dataset):
  function get_palette (line 138) | def get_palette(dataset):

FILE: lavis/common/annotator/uniformer/mmseg/core/evaluation/eval_hooks.py
  class EvalHook (line 7) | class EvalHook(_EvalHook):
    method __init__ (line 22) | def __init__(self, *args, by_epoch=False, efficient_test=False, **kwar...
    method after_train_iter (line 26) | def after_train_iter(self, runner):
    method after_train_epoch (line 42) | def after_train_epoch(self, runner):
  class DistEvalHook (line 55) | class DistEvalHook(_DistEvalHook):
    method __init__ (line 70) | def __init__(self, *args, by_epoch=False, efficient_test=False, **kwar...
    method after_train_iter (line 74) | def after_train_iter(self, runner):
    method after_train_epoch (line 93) | def after_train_epoch(self, runner):

FILE: lavis/common/annotator/uniformer/mmseg/core/evaluation/metrics.py
  function f_score (line 8) | def f_score(precision, recall, beta=1):
  function intersect_and_union (line 25) | def intersect_and_union(pred_label,
  function total_intersect_and_union (line 88) | def total_intersect_and_union(results,
  function mean_iou (line 133) | def mean_iou(results,
  function mean_dice (line 172) | def mean_dice(results,
  function mean_fscore (line 212) | def mean_fscore(results,
  function eval_metrics (line 257) | def eval_metrics(results,

FILE: lavis/common/annotator/uniformer/mmseg/core/seg/builder.py
  function build_pixel_sampler (line 6) | def build_pixel_sampler(cfg, **default_args):

FILE: lavis/common/annotator/uniformer/mmseg/core/seg/sampler/base_pixel_sampler.py
  class BasePixelSampler (line 4) | class BasePixelSampler(metaclass=ABCMeta):
    method __init__ (line 7) | def __init__(self, **kwargs):
    method sample (line 11) | def sample(self, seg_logit, seg_label):

FILE: lavis/common/annotator/uniformer/mmseg/core/seg/sampler/ohem_pixel_sampler.py
  class OHEMPixelSampler (line 9) | class OHEMPixelSampler(BasePixelSampler):
    method __init__ (line 23) | def __init__(self, context, thresh=None, min_kept=100000):
    method sample (line 30) | def sample(self, seg_logit, seg_label):

FILE: lavis/common/annotator/uniformer/mmseg/core/utils/misc.py
  function add_prefix (line 1) | def add_prefix(inputs, prefix):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/ade.py
  class ADE20KDataset (line 6) | class ADE20KDataset(CustomDataset):
    method __init__ (line 79) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/builder.py
  function _concat_dataset (line 25) | def _concat_dataset(cfg, default_args=None):
  function build_dataset (line 61) | def build_dataset(cfg, default_args=None):
  function build_dataloader (line 78) | def build_dataloader(dataset,
  function worker_init_fn (line 155) | def worker_init_fn(worker_id, num_workers, rank, seed):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/chase_db1.py
  class ChaseDB1Dataset (line 8) | class ChaseDB1Dataset(CustomDataset):
    method __init__ (line 21) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/cityscapes.py
  class CityscapesDataset (line 14) | class CityscapesDataset(CustomDataset):
    method __init__ (line 32) | def __init__(self, **kwargs):
    method _convert_to_label_id (line 39) | def _convert_to_label_id(result):
    method results2img (line 50) | def results2img(self, results, imgfile_prefix, to_label_id):
    method format_results (line 91) | def format_results(self, results, imgfile_prefix=None, to_label_id=True):
    method evaluate (line 124) | def evaluate(self,
    method _evaluate_cityscapes (line 164) | def _evaluate_cityscapes(self, results, logger, imgfile_prefix):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/custom.py
  class CustomDataset (line 19) | class CustomDataset(Dataset):
    method __init__ (line 75) | def __init__(self,
    method __len__ (line 116) | def __len__(self):
    method load_annotations (line 120) | def load_annotations(self, img_dir, img_suffix, ann_dir, seg_map_suffix,
    method get_ann_info (line 158) | def get_ann_info(self, idx):
    method pre_pipeline (line 170) | def pre_pipeline(self, results):
    method __getitem__ (line 178) | def __getitem__(self, idx):
    method prepare_train_img (line 194) | def prepare_train_img(self, idx):
    method prepare_test_img (line 211) | def prepare_test_img(self, idx):
    method format_results (line 227) | def format_results(self, results, **kwargs):
    method get_gt_seg_maps (line 230) | def get_gt_seg_maps(self, efficient_test=False):
    method get_classes_and_palette (line 243) | def get_classes_and_palette(self, classes=None, palette=None):
    method get_palette_for_custom_classes (line 287) | def get_palette_for_custom_classes(self, class_names, palette=None):
    method evaluate (line 306) | def evaluate(self,

FILE: lavis/common/annotator/uniformer/mmseg/datasets/dataset_wrappers.py
  class ConcatDataset (line 7) | class ConcatDataset(_ConcatDataset):
    method __init__ (line 17) | def __init__(self, datasets):
  class RepeatDataset (line 24) | class RepeatDataset(object):
    method __init__ (line 37) | def __init__(self, dataset, times):
    method __getitem__ (line 44) | def __getitem__(self, idx):
    method __len__ (line 48) | def __len__(self):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/drive.py
  class DRIVEDataset (line 8) | class DRIVEDataset(CustomDataset):
    method __init__ (line 21) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/hrf.py
  class HRFDataset (line 8) | class HRFDataset(CustomDataset):
    method __init__ (line 21) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/pascal_context.py
  class PascalContextDataset (line 8) | class PascalContextDataset(CustomDataset):
    method __init__ (line 47) | def __init__(self, split, **kwargs):
  class PascalContextDataset59 (line 58) | class PascalContextDataset59(CustomDataset):
    method __init__ (line 96) | def __init__(self, split, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/pipelines/compose.py
  class Compose (line 9) | class Compose(object):
    method __init__ (line 17) | def __init__(self, transforms):
    method __call__ (line 29) | def __call__(self, data):
    method __repr__ (line 45) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/pipelines/formating.py
  function to_tensor (line 11) | def to_tensor(data):
  class ToTensor (line 37) | class ToTensor(object):
    method __init__ (line 44) | def __init__(self, keys):
    method __call__ (line 47) | def __call__(self, results):
    method __repr__ (line 62) | def __repr__(self):
  class ImageToTensor (line 67) | class ImageToTensor(object):
    method __init__ (line 78) | def __init__(self, keys):
    method __call__ (line 81) | def __call__(self, results):
    method __repr__ (line 100) | def __repr__(self):
  class Transpose (line 105) | class Transpose(object):
    method __init__ (line 113) | def __init__(self, keys, order):
    method __call__ (line 117) | def __call__(self, results):
    method __repr__ (line 133) | def __repr__(self):
  class ToDataContainer (line 139) | class ToDataContainer(object):
    method __init__ (line 150) | def __init__(self,
    method __call__ (line 155) | def __call__(self, results):
    method __repr__ (line 173) | def __repr__(self):
  class DefaultFormatBundle (line 178) | class DefaultFormatBundle(object):
    method __call__ (line 189) | def __call__(self, results):
    method __repr__ (line 214) | def __repr__(self):
  class Collect (line 219) | class Collect(object):
    method __init__ (line 256) | def __init__(self,
    method __call__ (line 264) | def __call__(self, results):
    method __repr__ (line 286) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/pipelines/loading.py
  class LoadImageFromFile (line 10) | class LoadImageFromFile(object):
    method __init__ (line 31) | def __init__(self,
    method __call__ (line 42) | def __call__(self, results):
    method __repr__ (line 81) | def __repr__(self):
  class LoadAnnotations (line 90) | class LoadAnnotations(object):
    method __init__ (line 104) | def __init__(self,
    method __call__ (line 113) | def __call__(self, results):
    method __repr__ (line 149) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/pipelines/test_time_aug.py
  class MultiScaleFlipAug (line 10) | class MultiScaleFlipAug(object):
    method __init__ (line 53) | def __init__(self,
    method __call__ (line 93) | def __call__(self, results):
    method __repr__ (line 128) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/pipelines/transforms.py
  class Resize (line 10) | class Resize(object):
    method __init__ (line 41) | def __init__(self,
    method random_select (line 68) | def random_select(img_scales):
    method random_sample (line 86) | def random_sample(img_scales):
    method random_sample_ratio (line 113) | def random_sample_ratio(img_scale, ratio_range):
    method _random_scale (line 139) | def _random_scale(self, results):
    method _resize_img (line 177) | def _resize_img(self, results):
    method _resize_seg (line 199) | def _resize_seg(self, results):
    method __call__ (line 210) | def __call__(self, results):
    method __repr__ (line 228) | def __repr__(self):
  class RandomFlip (line 238) | class RandomFlip(object):
    method __init__ (line 252) | def __init__(self, prob=None, direction='horizontal'):
    method __call__ (line 259) | def __call__(self, results):
    method __repr__ (line 288) | def __repr__(self):
  class Pad (line 293) | class Pad(object):
    method __init__ (line 308) | def __init__(self,
    method _pad_img (line 321) | def _pad_img(self, results):
    method _pad_seg (line 334) | def _pad_seg(self, results):
    method __call__ (line 342) | def __call__(self, results):
    method __repr__ (line 356) | def __repr__(self):
  class Normalize (line 364) | class Normalize(object):
    method __init__ (line 376) | def __init__(self, mean, std, to_rgb=True):
    method __call__ (line 381) | def __call__(self, results):
    method __repr__ (line 398) | def __repr__(self):
  class Rerange (line 406) | class Rerange(object):
    method __init__ (line 416) | def __init__(self, min_value=0, max_value=255):
    method __call__ (line 423) | def __call__(self, results):
    method __repr__ (line 445) | def __repr__(self):
  class CLAHE (line 452) | class CLAHE(object):
    method __init__ (line 465) | def __init__(self, clip_limit=40.0, tile_grid_size=(8, 8)):
    method __call__ (line 472) | def __call__(self, results):
    method __repr__ (line 489) | def __repr__(self):
  class RandomCrop (line 497) | class RandomCrop(object):
    method __init__ (line 506) | def __init__(self, crop_size, cat_max_ratio=1., ignore_index=255):
    method get_crop_bbox (line 512) | def get_crop_bbox(self, img):
    method crop (line 523) | def crop(self, img, crop_bbox):
    method __call__ (line 529) | def __call__(self, results):
    method __repr__ (line 565) | def __repr__(self):
  class RandomRotate (line 570) | class RandomRotate(object):
    method __init__ (line 588) | def __init__(self,
    method __call__ (line 609) | def __call__(self, results):
    method __repr__ (line 641) | def __repr__(self):
  class RGB2Gray (line 653) | class RGB2Gray(object):
    method __init__ (line 668) | def __init__(self, out_channels=None, weights=(0.299, 0.587, 0.114)):
    method __call__ (line 676) | def __call__(self, results):
    method __repr__ (line 700) | def __repr__(self):
  class AdjustGamma (line 708) | class AdjustGamma(object):
    method __init__ (line 716) | def __init__(self, gamma=1.0):
    method __call__ (line 724) | def __call__(self, results):
    method __repr__ (line 739) | def __repr__(self):
  class SegRescale (line 744) | class SegRescale(object):
    method __init__ (line 751) | def __init__(self, scale_factor=1):
    method __call__ (line 754) | def __call__(self, results):
    method __repr__ (line 769) | def __repr__(self):
  class PhotoMetricDistortion (line 774) | class PhotoMetricDistortion(object):
    method __init__ (line 794) | def __init__(self,
    method convert (line 804) | def convert(self, img, alpha=1, beta=0):
    method brightness (line 810) | def brightness(self, img):
    method contrast (line 819) | def contrast(self, img):
    method saturation (line 827) | def saturation(self, img):
    method hue (line 838) | def hue(self, img):
    method __call__ (line 848) | def __call__(self, results):
    method __repr__ (line 881) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/stare.py
  class STAREDataset (line 8) | class STAREDataset(CustomDataset):
    method __init__ (line 21) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/datasets/voc.py
  class PascalVOCDataset (line 8) | class PascalVOCDataset(CustomDataset):
    method __init__ (line 26) | def __init__(self, split, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/cgnet.py
  class GlobalContextExtractor (line 13) | class GlobalContextExtractor(nn.Module):
    method __init__ (line 26) | def __init__(self, channel, reduction=16, with_cp=False):
    method forward (line 37) | def forward(self, x):
  class ContextGuidedBlock (line 53) | class ContextGuidedBlock(nn.Module):
    method __init__ (line 78) | def __init__(self,
    method forward (line 142) | def forward(self, x):
  class InputInjection (line 170) | class InputInjection(nn.Module):
    method __init__ (line 173) | def __init__(self, num_downsampling):
    method forward (line 179) | def forward(self, x):
  class CGNet (line 186) | class CGNet(nn.Module):
    method __init__ (line 215) | def __init__(self,
    method forward (line 309) | def forward(self, x):
    method init_weights (line 338) | def init_weights(self, pretrained=None):
    method train (line 359) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/fast_scnn.py
  class LearningToDownsample (line 13) | class LearningToDownsample(nn.Module):
    method __init__ (line 29) | def __init__(self,
    method forward (line 66) | def forward(self, x):
  class GlobalFeatureExtractor (line 73) | class GlobalFeatureExtractor(nn.Module):
    method __init__ (line 106) | def __init__(self,
    method _make_layer (line 148) | def _make_layer(self,
    method forward (line 172) | def forward(self, x):
  class FeatureFusionModule (line 181) | class FeatureFusionModule(nn.Module):
    method __init__ (line 199) | def __init__(self,
    method forward (line 235) | def forward(self, higher_res_feature, lower_res_feature):
  class FastSCNN (line 250) | class FastSCNN(nn.Module):
    method __init__ (line 296) | def __init__(self,
    method init_weights (line 360) | def init_weights(self, pretrained=None):
    method forward (line 367) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/hrnet.py
  class HRModule (line 13) | class HRModule(nn.Module):
    method __init__ (line 20) | def __init__(self,
    method _check_branches (line 46) | def _check_branches(self, num_branches, num_blocks, in_channels,
    method _make_one_branch (line 64) | def _make_one_branch(self,
    method _make_branches (line 109) | def _make_branches(self, num_branches, block, num_blocks, num_channels):
    method _make_fuse_layers (line 119) | def _make_fuse_layers(self):
    method forward (line 185) | def forward(self, x):
  class HRNet (line 212) | class HRNet(nn.Module):
    method __init__ (line 273) | def __init__(self,
    method norm1 (line 362) | def norm1(self):
    method norm2 (line 367) | def norm2(self):
    method _make_transition_layer (line 371) | def _make_transition_layer(self, num_channels_pre_layer,
    method _make_layer (line 418) | def _make_layer(self, block, inplanes, planes, blocks, stride=1):
    method _make_stage (line 454) | def _make_stage(self, layer_config, in_channels, multiscale_output=True):
    method init_weights (line 484) | def init_weights(self, pretrained=None):
    method forward (line 510) | def forward(self, x):
    method train (line 547) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/mobilenet_v2.py
  class MobileNetV2 (line 13) | class MobileNetV2(nn.Module):
    method __init__ (line 45) | def __init__(self,
    method make_layer (line 107) | def make_layer(self, out_channels, num_blocks, stride, dilation,
    method init_weights (line 136) | def init_weights(self, pretrained=None):
    method forward (line 149) | def forward(self, x):
    method _freeze_stages (line 164) | def _freeze_stages(self):
    method train (line 174) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/mobilenet_v3.py
  class MobileNetV3 (line 15) | class MobileNetV3(nn.Module):
    method __init__ (line 70) | def __init__(self,
    method _make_layer (line 104) | def _make_layer(self):
    method init_weights (line 220) | def init_weights(self, pretrained=None):
    method forward (line 233) | def forward(self, x):
    method _freeze_stages (line 242) | def _freeze_stages(self):
    method train (line 249) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/resnest.py
  class RSoftmax (line 15) | class RSoftmax(nn.Module):
    method __init__ (line 23) | def __init__(self, radix, groups):
    method forward (line 28) | def forward(self, x):
  class SplitAttentionConv2d (line 39) | class SplitAttentionConv2d(nn.Module):
    method __init__ (line 58) | def __init__(self,
    method norm0 (line 108) | def norm0(self):
    method norm1 (line 113) | def norm1(self):
    method forward (line 117) | def forward(self, x):
  class Bottleneck (line 146) | class Bottleneck(_Bottleneck):
    method __init__ (line 165) | def __init__(self,
    method forward (line 226) | def forward(self, x):
  class ResNeSt (line 270) | class ResNeSt(ResNetV1d):
    method __init__ (line 291) | def __init__(self,
    method make_res_layer (line 305) | def make_res_layer(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/resnet.py
  class BasicBlock (line 13) | class BasicBlock(nn.Module):
    method __init__ (line 18) | def __init__(self,
    method norm1 (line 58) | def norm1(self):
    method norm2 (line 63) | def norm2(self):
    method forward (line 67) | def forward(self, x):
  class Bottleneck (line 97) | class Bottleneck(nn.Module):
    method __init__ (line 106) | def __init__(self,
    method make_block_plugins (line 219) | def make_block_plugins(self, in_channels, plugins):
    method forward_plugin (line 242) | def forward_plugin(self, x, plugin_names):
    method norm1 (line 250) | def norm1(self):
    method norm2 (line 255) | def norm2(self):
    method norm3 (line 260) | def norm3(self):
    method forward (line 264) | def forward(self, x):
  class ResNet (line 308) | class ResNet(nn.Module):
    method __init__ (line 373) | def __init__(self,
    method make_stage_plugins (line 470) | def make_stage_plugins(self, plugins, stage_idx):
    method make_res_layer (line 523) | def make_res_layer(self, **kwargs):
    method norm1 (line 528) | def norm1(self):
    method _make_stem_layer (line 532) | def _make_stem_layer(self, in_channels, stem_channels):
    method _freeze_stages (line 581) | def _freeze_stages(self):
    method init_weights (line 600) | def init_weights(self, pretrained=None):
    method forward (line 632) | def forward(self, x):
    method train (line 649) | def train(self, mode=True):
  class ResNetV1c (line 662) | class ResNetV1c(ResNet):
    method __init__ (line 672) | def __init__(self, **kwargs):
  class ResNetV1d (line 678) | class ResNetV1d(ResNet):
    method __init__ (line 686) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/resnext.py
  class Bottleneck (line 11) | class Bottleneck(_Bottleneck):
    method __init__ (line 18) | def __init__(self,
  class ResNeXt (line 87) | class ResNeXt(ResNet):
    method __init__ (line 134) | def __init__(self, groups=1, base_width=4, **kwargs):
    method make_res_layer (line 139) | def make_res_layer(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/unet.py
  class BasicConvBlock (line 13) | class BasicConvBlock(nn.Module):
    method __init__ (line 43) | def __init__(self,
    method forward (line 76) | def forward(self, x):
  class DeconvModule (line 87) | class DeconvModule(nn.Module):
    method __init__ (line 105) | def __init__(self,
    method forward (line 137) | def forward(self, x):
  class InterpConv (line 148) | class InterpConv(nn.Module):
    method __init__ (line 179) | def __init__(self,
    method forward (line 211) | def forward(self, x):
  class UNet (line 222) | class UNet(nn.Module):
    method __init__ (line 277) | def __init__(self,
    method forward (line 376) | def forward(self, x):
    method train (line 389) | def train(self, mode=True):
    method _check_input_divisible (line 399) | def _check_input_divisible(self, x):
    method init_weights (line 412) | def init_weights(self, pretrained=None):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/uniformer.py
  class Mlp (line 24) | class Mlp(nn.Module):
    method __init__ (line 25) | def __init__(self, in_features, hidden_features=None, out_features=Non...
    method forward (line 34) | def forward(self, x):
  class CMlp (line 43) | class CMlp(nn.Module):
    method __init__ (line 44) | def __init__(self, in_features, hidden_features=None, out_features=Non...
    method forward (line 53) | def forward(self, x):
  class CBlock (line 62) | class CBlock(nn.Module):
    method __init__ (line 63) | def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_sc...
    method forward (line 77) | def forward(self, x):
  class Attention (line 84) | class Attention(nn.Module):
    method __init__ (line 85) | def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, at...
    method forward (line 97) | def forward(self, x):
  class SABlock (line 112) | class SABlock(nn.Module):
    method __init__ (line 113) | def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_sc...
    method forward (line 128) | def forward(self, x):
  function window_partition (line 138) | def window_partition(x, window_size):
  function window_reverse (line 152) | def window_reverse(windows, window_size, H, W):
  class SABlock_Windows (line 168) | class SABlock_Windows(nn.Module):
    method __init__ (line 169) | def __init__(self, dim, num_heads, window_size=14, mlp_ratio=4., qkv_b...
    method forward (line 185) | def forward(self, x):
  class PatchEmbed (line 218) | class PatchEmbed(nn.Module):
    method __init__ (line 221) | def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=...
    method forward (line 232) | def forward(self, x):
  class UniFormer (line 243) | class UniFormer(nn.Module):
    method __init__ (line 248) | def __init__(self, layers=[3, 4, 8, 3], img_size=224, in_chans=3, num_...
    method init_weights (line 358) | def init_weights(self, pretrained):
    method _init_weights (line 363) | def _init_weights(self, m):
    method no_weight_decay (line 373) | def no_weight_decay(self):
    method get_classifier (line 376) | def get_classifier(self):
    method reset_classifier (line 379) | def reset_classifier(self, num_classes, global_pool=''):
    method forward_features (line 383) | def forward_features(self, x):
    method forward (line 420) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmseg/models/backbones/vit.py
  class Mlp (line 20) | class Mlp(nn.Module):
    method __init__ (line 36) | def __init__(self,
    method forward (line 50) | def forward(self, x):
  class Attention (line 59) | class Attention(nn.Module):
    method __init__ (line 72) | def __init__(self,
    method forward (line 89) | def forward(self, x):
  class Block (line 105) | class Block(nn.Module):
    method __init__ (line 128) | def __init__(self,
    method forward (line 156) | def forward(self, x):
  class PatchEmbed (line 171) | class PatchEmbed(nn.Module):
    method __init__ (line 183) | def __init__(self,
    method forward (line 201) | def forward(self, x):
  class VisionTransformer (line 206) | class VisionTransformer(nn.Module):
    method __init__ (line 246) | def __init__(self,
    method init_weights (line 314) | def init_weights(self, pretrained=None):
    method _pos_embeding (line 359) | def _pos_embeding(self, img, patched_img, pos_embed):
    method resize_pos_embed (line 392) | def resize_pos_embed(pos_embed, input_shpae, pos_shape, patch_size, mo...
    method forward (line 421) | def forward(self, inputs):
    method train (line 454) | def train(self, mode=True):

FILE: lavis/common/annotator/uniformer/mmseg/models/builder.py
  function build_backbone (line 15) | def build_backbone(cfg):
  function build_neck (line 20) | def build_neck(cfg):
  function build_head (line 25) | def build_head(cfg):
  function build_loss (line 30) | def build_loss(cfg):
  function build_segmentor (line 35) | def build_segmentor(cfg, train_cfg=None, test_cfg=None):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/ann_head.py
  class PPMConcat (line 10) | class PPMConcat(nn.ModuleList):
    method __init__ (line 18) | def __init__(self, pool_scales=(1, 3, 6, 8)):
    method forward (line 22) | def forward(self, feats):
  class SelfAttentionBlock (line 32) | class SelfAttentionBlock(_SelfAttentionBlock):
    method __init__ (line 52) | def __init__(self, low_in_channels, high_in_channels, channels,
  class AFNB (line 79) | class AFNB(nn.Module):
    method __init__ (line 99) | def __init__(self, low_in_channels, high_in_channels, channels,
    method forward (line 125) | def forward(self, low_feats, high_feats):
  class APNB (line 133) | class APNB(nn.Module):
    method __init__ (line 150) | def __init__(self, in_channels, channels, out_channels, query_scales,
    method forward (line 175) | def forward(self, feats):
  class ANNHead (line 184) | class ANNHead(BaseDecodeHead):
    method __init__ (line 198) | def __init__(self,
    method forward (line 236) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/apc_head.py
  class ACM (line 11) | class ACM(nn.Module):
    method __init__ (line 25) | def __init__(self, pool_scale, fusion, in_channels, channels, conv_cfg,
    method forward (line 78) | def forward(self, x):
  class APCHead (line 110) | class APCHead(BaseDecodeHead):
    method __init__ (line 124) | def __init__(self, pool_scales=(1, 2, 3, 6), fusion=True, **kwargs):
    method forward (line 149) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/aspp_head.py
  class ASPPModule (line 10) | class ASPPModule(nn.ModuleList):
    method __init__ (line 22) | def __init__(self, dilations, in_channels, channels, conv_cfg, norm_cfg,
    method forward (line 43) | def forward(self, x):
  class ASPPHead (line 53) | class ASPPHead(BaseDecodeHead):
    method __init__ (line 64) | def __init__(self, dilations=(1, 6, 12, 18), **kwargs):
    method forward (line 93) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/cascade_decode_head.py
  class BaseCascadeDecodeHead (line 6) | class BaseCascadeDecodeHead(BaseDecodeHead, metaclass=ABCMeta):
    method __init__ (line 10) | def __init__(self, *args, **kwargs):
    method forward (line 14) | def forward(self, inputs, prev_output):
    method forward_train (line 18) | def forward_train(self, inputs, prev_output, img_metas, gt_semantic_seg,
    method forward_test (line 41) | def forward_test(self, inputs, prev_output, img_metas, test_cfg):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/cc_head.py
  class CCHead (line 13) | class CCHead(FCNHead):
    method __init__ (line 24) | def __init__(self, recurrence=2, **kwargs):
    method forward (line 32) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/da_head.py
  class PAM (line 12) | class PAM(_SelfAttentionBlock):
    method __init__ (line 20) | def __init__(self, in_channels, channels):
    method forward (line 41) | def forward(self, x):
  class CAM (line 49) | class CAM(nn.Module):
    method __init__ (line 52) | def __init__(self):
    method forward (line 56) | def forward(self, x):
  class DAHead (line 75) | class DAHead(BaseDecodeHead):
    method __init__ (line 85) | def __init__(self, pam_channels, **kwargs):
    method pam_cls_seg (line 128) | def pam_cls_seg(self, feat):
    method cam_cls_seg (line 135) | def cam_cls_seg(self, feat):
    method forward (line 142) | def forward(self, inputs):
    method forward_test (line 160) | def forward_test(self, inputs, img_metas, test_cfg):
    method losses (line 164) | def losses(self, seg_logit, seg_label):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/decode_head.py
  class BaseDecodeHead (line 14) | class BaseDecodeHead(nn.Module, metaclass=ABCMeta):
    method __init__ (line 46) | def __init__(self,
    method extra_repr (line 88) | def extra_repr(self):
    method _init_inputs (line 95) | def _init_inputs(self, in_channels, in_index, input_transform):
    method init_weights (line 133) | def init_weights(self):
    method _transform_inputs (line 137) | def _transform_inputs(self, inputs):
    method forward (line 166) | def forward(self, inputs):
    method forward_train (line 170) | def forward_train(self, inputs, img_metas, gt_semantic_seg, train_cfg):
    method forward_test (line 190) | def forward_test(self, inputs, img_metas, test_cfg):
    method cls_seg (line 207) | def cls_seg(self, feat):
    method losses (line 215) | def losses(self, seg_logit, seg_label):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/dm_head.py
  class DCM (line 10) | class DCM(nn.Module):
    method __init__ (line 24) | def __init__(self, filter_size, fusion, in_channels, channels, conv_cfg,
    method forward (line 60) | def forward(self, x):
  class DMHead (line 92) | class DMHead(BaseDecodeHead):
    method __init__ (line 106) | def __init__(self, filter_sizes=(1, 3, 5, 7), fusion=False, **kwargs):
    method forward (line 131) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/dnl_head.py
  class DisentangledNonLocal2d (line 9) | class DisentangledNonLocal2d(NonLocal2d):
    method __init__ (line 16) | def __init__(self, *arg, temperature, **kwargs):
    method embedded_gaussian (line 21) | def embedded_gaussian(self, theta_x, phi_x):
    method forward (line 33) | def forward(self, x):
  class DNLHead (line 87) | class DNLHead(FCNHead):
    method __init__ (line 102) | def __init__(self,
    method forward (line 122) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/ema_head.py
  function reduce_mean (line 13) | def reduce_mean(tensor):
  class EMAModule (line 22) | class EMAModule(nn.Module):
    method __init__ (line 31) | def __init__(self, channels, num_bases, num_stages, momentum):
    method forward (line 44) | def forward(self, feats):
  class EMAHead (line 79) | class EMAHead(BaseDecodeHead):
    method __init__ (line 94) | def __init__(self,
    method forward (line 154) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/enc_head.py
  class EncModule (line 11) | class EncModule(nn.Module):
    method __init__ (line 22) | def __init__(self, in_channels, num_codes, conv_cfg, norm_cfg, act_cfg):
    method forward (line 50) | def forward(self, x):
  class EncHead (line 62) | class EncHead(BaseDecodeHead):
    method __init__ (line 78) | def __init__(self,
    method forward (line 129) | def forward(self, inputs):
    method forward_test (line 151) | def forward_test(self, inputs, img_metas, test_cfg):
    method _convert_to_onehot_labels (line 159) | def _convert_to_onehot_labels(seg_label, num_classes):
    method losses (line 178) | def losses(self, seg_logit, seg_label):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/fcn_head.py
  class FCNHead (line 10) | class FCNHead(BaseDecodeHead):
    method __init__ (line 23) | def __init__(self,
    method forward (line 74) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/fpn_head.py
  class FPNHead (line 11) | class FPNHead(BaseDecodeHead):
    method __init__ (line 23) | def __init__(self, feature_strides, **kwargs):
    method forward (line 54) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/gc_head.py
  class GCHead (line 9) | class GCHead(FCNHead):
    method __init__ (line 23) | def __init__(self,
    method forward (line 38) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/lraspp_head.py
  class LRASPPHead (line 12) | class LRASPPHead(BaseDecodeHead):
    method __init__ (line 23) | def __init__(self, branch_channels=(32, 64), **kwargs):
    method forward (line 68) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/nl_head.py
  class NLHead (line 9) | class NLHead(FCNHead):
    method __init__ (line 23) | def __init__(self,
    method forward (line 40) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/ocr_head.py
  class SpatialGatherModule (line 12) | class SpatialGatherModule(nn.Module):
    method __init__ (line 19) | def __init__(self, scale):
    method forward (line 23) | def forward(self, feats, probs):
  class ObjectAttentionBlock (line 39) | class ObjectAttentionBlock(_SelfAttentionBlock):
    method __init__ (line 42) | def __init__(self, in_channels, channels, scale, conv_cfg, norm_cfg,
    method forward (line 73) | def forward(self, query_feats, key_feats):
  class OCRHead (line 85) | class OCRHead(BaseCascadeDecodeHead):
    method __init__ (line 97) | def __init__(self, ocr_channels, scale=1, **kwargs):
    method forward (line 119) | def forward(self, inputs, prev_output):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/point_head.py
  function calculate_uncertainty (line 14) | def calculate_uncertainty(seg_logits):
  class PointHead (line 35) | class PointHead(BaseCascadeDecodeHead):
    method __init__ (line 60) | def __init__(self,
    method init_weights (line 104) | def init_weights(self):
    method cls_seg (line 108) | def cls_seg(self, feat):
    method forward (line 115) | def forward(self, fine_grained_point_feats, coarse_point_feats):
    method _get_fine_grained_point_feats (line 123) | def _get_fine_grained_point_feats(self, x, points):
    method _get_coarse_point_feats (line 147) | def _get_coarse_point_feats(self, prev_output, points):
    method forward_train (line 165) | def forward_train(self, inputs, prev_output, img_metas, gt_semantic_seg,
    method forward_test (line 203) | def forward_test(self, inputs, prev_output, img_metas, test_cfg):
    method losses (line 248) | def losses(self, point_logits, point_label):
    method get_points_train (line 256) | def get_points_train(self, seg_logits, uncertainty_func, cfg):
    method get_points_test (line 310) | def get_points_test(self, seg_logits, uncertainty_func, cfg):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/psa_head.py
  class PSAHead (line 17) | class PSAHead(BaseDecodeHead):
    method __init__ (line 35) | def __init__(self,
    method forward (line 113) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/psp_head.py
  class PPM (line 10) | class PPM(nn.ModuleList):
    method __init__ (line 24) | def __init__(self, pool_scales, in_channels, channels, conv_cfg, norm_...
    method forward (line 46) | def forward(self, x):
  class PSPHead (line 61) | class PSPHead(BaseDecodeHead):
    method __init__ (line 72) | def __init__(self, pool_scales=(1, 2, 3, 6), **kwargs):
    method forward (line 93) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/sep_aspp_head.py
  class DepthwiseSeparableASPPModule (line 10) | class DepthwiseSeparableASPPModule(ASPPModule):
    method __init__ (line 14) | def __init__(self, **kwargs):
  class DepthwiseSeparableASPPHead (line 29) | class DepthwiseSeparableASPPHead(ASPPHead):
    method __init__ (line 42) | def __init__(self, c1_in_channels, c1_channels, **kwargs):
    method forward (line 78) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/sep_fcn_head.py
  class DepthwiseSeparableFCNHead (line 8) | class DepthwiseSeparableFCNHead(FCNHead):
    method __init__ (line 29) | def __init__(self, **kwargs):

FILE: lavis/common/annotator/uniformer/mmseg/models/decode_heads/uper_head.py
  class UPerHead (line 12) | class UPerHead(BaseDecodeHead):
    method __init__ (line 23) | def __init__(self, pool_scales=(1, 2, 3, 6), **kwargs):
    method psp_forward (line 76) | def psp_forward(self, inputs):
    method forward (line 86) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/losses/accuracy.py
  function accuracy (line 4) | def accuracy(pred, target, topk=1, thresh=None):
  class Accuracy (line 52) | class Accuracy(nn.Module):
    method __init__ (line 55) | def __init__(self, topk=(1, ), thresh=None):
    method forward (line 68) | def forward(self, pred, target):

FILE: lavis/common/annotator/uniformer/mmseg/models/losses/cross_entropy_loss.py
  function cross_entropy (line 9) | def cross_entropy(pred,
  function _expand_onehot_labels (line 35) | def _expand_onehot_labels(labels, label_weights, target_shape, ignore_in...
  function binary_cross_entropy (line 57) | def binary_cross_entropy(pred,
  function mask_cross_entropy (line 100) | def mask_cross_entropy(pred,
  class CrossEntropyLoss (line 139) | class CrossEntropyLoss(nn.Module):
    method __init__ (line 154) | def __init__(self,
    method forward (line 175) | def forward(self,

FILE: lavis/common/annotator/uniformer/mmseg/models/losses/dice_loss.py
  function dice_loss (line 12) | def dice_loss(pred,
  function binary_dice_loss (line 37) | def binary_dice_loss(pred, target, valid_mask, smooth=1, exponent=2, **k...
  class DiceLoss (line 50) | class DiceLoss(nn.Module):
    method __init__ (line 72) | def __init__(self,
    method forward (line 88) | def forward(self,

FILE: lavis/common/annotator/uniformer/mmseg/models/losses/lovasz_loss.py
  function lovasz_grad (line 14) | def lovasz_grad(gt_sorted):
  function flatten_binary_logits (line 29) | def flatten_binary_logits(logits, labels, ignore_index=None):
  function flatten_probs (line 42) | def flatten_probs(probs, labels, ignore_index=None):
  function lovasz_hinge_flat (line 59) | def lovasz_hinge_flat(logits, labels):
  function lovasz_hinge (line 83) | def lovasz_hinge(logits,
  function lovasz_softmax_flat (line 128) | def lovasz_softmax_flat(probs, labels, classes='present', class_weight=N...
  function lovasz_softmax (line 171) | def lovasz_softmax(probs,
  class LovaszLoss (line 225) | class LovaszLoss(nn.Module):
    method __init__ (line 248) | def __init__(self,
    method forward (line 274) | def forward(self,

FILE: lavis/common/annotator/uniformer/mmseg/models/losses/utils.py
  function get_class_weight (line 8) | def get_class_weight(class_weight):
  function reduce_loss (line 26) | def reduce_loss(loss, reduction):
  function weight_reduce_loss (line 46) | def weight_reduce_loss(loss, weight=None, reduction='mean', avg_factor=N...
  function weighted_loss (line 78) | def weighted_loss(loss_func):

FILE: lavis/common/annotator/uniformer/mmseg/models/necks/fpn.py
  class FPN (line 9) | class FPN(nn.Module):
    method __init__ (line 63) | def __init__(self,
    method init_weights (line 157) | def init_weights(self):
    method forward (line 162) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/necks/multilevel_neck.py
  class MultiLevelNeck (line 9) | class MultiLevelNeck(nn.Module):
    method __init__ (line 22) | def __init__(self,
    method forward (line 55) | def forward(self, inputs):

FILE: lavis/common/annotator/uniformer/mmseg/models/segmentors/base.py
  class BaseSegmentor (line 14) | class BaseSegmentor(nn.Module):
    method __init__ (line 19) | def __init__(self):
    method with_neck (line 24) | def with_neck(self):
    method with_auxiliary_head (line 29) | def with_auxiliary_head(self):
    method with_decode_head (line 35) | def with_decode_head(self):
    method extract_feat (line 40) | def extract_feat(self, imgs):
    method encode_decode (line 45) | def encode_decode(self, img, img_metas):
    method forward_train (line 51) | def forward_train(self, imgs, img_metas, **kwargs):
    method simple_test (line 56) | def simple_test(self, img, img_meta, **kwargs):
    method aug_test (line 61) | def aug_test(self, imgs, img_metas, **kwargs):
    method init_weights (line 65) | def init_weights(self, pretrained=None):
    method forward_test (line 76) | def forward_test(self, imgs, img_metas, **kwargs):
    method forward (line 111) | def forward(self, img, img_metas, return_loss=True, **kwargs):
    method train_step (line 126) | def train_step(self, data_batch, optimizer, **kwargs):
    method val_step (line 162) | def val_step(self, data_batch, **kwargs):
    method _parse_losses (line 173) | def _parse_losses(losses):
    method show_result (line 208) | def show_result(self,

FILE: lavis/common/annotator/uniformer/mmseg/models/segmentors/cascade_encoder_decoder.py
  class CascadeEncoderDecoder (line 11) | class CascadeEncoderDecoder(EncoderDecoder):
    method __init__ (line 19) | def __init__(self,
    method _init_decode_head (line 38) | def _init_decode_head(self, decode_head):
    method init_weights (line 48) | def init_weights(self, pretrained=None):
    method encode_decode (line 65) | def encode_decode(self, img, img_metas):
    method _decode_head_forward_train (line 80) | def _decode_head_forward_train(self, x, img_metas, gt_semantic_seg):

FILE: lavis/common/annotator/uniformer/mmseg/models/segmentors/encoder_decoder.py
  class EncoderDecoder (line 13) | class EncoderDecoder(BaseSegmentor):
    method __init__ (line 21) | def __init__(self,
    method _init_decode_head (line 43) | def _init_decode_head(self, decode_head):
    method _init_auxiliary_head (line 49) | def _init_auxiliary_head(self, auxiliary_head):
    method init_weights (line 59) | def init_weights(self, pretrained=None):
    method extract_feat (line 77) | def extract_feat(self, img):
    method encode_decode (line 84) | def encode_decode(self, img, img_metas):
    method _decode_head_forward_train (line 96) | def _decode_head_forward_train(self, x, img_metas, gt_semantic_seg):
    method _decode_head_forward_test (line 107) | def _decode_head_forward_test(self, x, img_metas):
    method _auxiliary_head_forward_train (line 113) | def _auxiliary_head_forward_train(self, x, img_metas, gt_semantic_seg):
    method forward_dummy (line 130) | def forward_dummy(self, img):
    method forward_train (line 136) | def forward_train(self, img, img_metas, gt_semantic_seg):
    method slide_inference (line 169) | def slide_inference(self, img, img_meta, rescale):
    method whole_inference (line 214) | def whole_inference(self, img, img_meta, rescale):
    method inference (line 233) | def inference(self, img, img_meta, rescale):
    method simple_test (line 268) | def simple_test(self, img, img_meta, rescale=True):
    method aug_test (line 281) | def aug_test(self, imgs, img_metas, rescale=True):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/drop.py
  class DropPath (line 8) | class DropPath(nn.Module):
    method __init__ (line 17) | def __init__(self, drop_prob=0.):
    method forward (line 22) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/inverted_residual.py
  class InvertedResidual (line 8) | class InvertedResidual(nn.Module):
    method __init__ (line 31) | def __init__(self,
    method forward (line 81) | def forward(self, x):
  class InvertedResidualV3 (line 97) | class InvertedResidualV3(nn.Module):
    method __init__ (line 124) | def __init__(self,
    method forward (line 183) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/make_divisible.py
  function make_divisible (line 1) | def make_divisible(value, divisor, min_value=None, min_ratio=0.9):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/res_layer.py
  class ResLayer (line 5) | class ResLayer(nn.Sequential):
    method __init__ (line 26) | def __init__(self,

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/se_layer.py
  class SELayer (line 8) | class SELayer(nn.Module):
    method __init__ (line 26) | def __init__(self,
    method forward (line 53) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/self_attention_block.py
  class SelfAttentionBlock (line 7) | class SelfAttentionBlock(nn.Module):
    method __init__ (line 32) | def __init__(self, key_in_channels, query_in_channels, channels,
    method init_weights (line 93) | def init_weights(self):
    method build_project (line 99) | def build_project(self, in_channels, channels, num_convs, use_conv_mod...
    method forward (line 131) | def forward(self, query_feats, key_feats):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/up_conv_block.py
  class UpConvBlock (line 6) | class UpConvBlock(nn.Module):
    method __init__ (line 44) | def __init__(self,
    method forward (line 94) | def forward(self, skip, x):

FILE: lavis/common/annotator/uniformer/mmseg/models/utils/weight_init.py
  function _no_grad_trunc_normal_ (line 10) | def _no_grad_trunc_normal_(tensor, mean, std, a, b):
  function trunc_normal_ (line 48) | def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):

FILE: lavis/common/annotator/uniformer/mmseg/ops/encoding.py
  class Encoding (line 6) | class Encoding(nn.Module):
    method __init__ (line 17) | def __init__(self, channels, num_codes):
    method scaled_l2 (line 33) | def scaled_l2(x, codewords, scale):
    method aggregate (line 46) | def aggregate(assignment_weights, x, codewords):
    method forward (line 57) | def forward(self, x):
    method __repr__ (line 70) | def __repr__(self):

FILE: lavis/common/annotator/uniformer/mmseg/ops/wrappers.py
  function resize (line 7) | def resize(input,
  class Upsample (line 29) | class Upsample(nn.Module):
    method __init__ (line 31) | def __init__(self,
    method forward (line 45) | def forward(self, x):

FILE: lavis/common/annotator/uniformer/mmseg/utils/collect_env.py
  function collect_env (line 7) | def collect_env():

FILE: lavis/common/annotator/uniformer/mmseg/utils/logger.py
  function get_root_logger (line 6) | def get_root_logger(log_file=None, log_level=logging.INFO):

FILE: lavis/common/annotator/util.py
  function HWC3 (line 9) | def HWC3(x):
  function resize_image (line 28) | def resize_image(input_image, resolution):

FILE: lavis/common/config.py
  class Config (line 16) | class Config:
    method __init__ (line 17) | def __init__(self, args):
    method _validate_runner_config (line 43) | def _validate_runner_config(self, runner_config):
    method _build_opt_list (line 52) | def _build_opt_list(self, opts):
    method build_model_config (line 57) | def build_model_config(config, **kwargs):
    method build_runner_config (line 84) | def build_runner_config(config):
    method build_dataset_config (line 88) | def build_dataset_config(config):
    method _convert_to_dot_list (line 114) | def _convert_to_dot_list(self, opts):
    method get_config (line 128) | def get_config(self):
    method run_cfg (line 132) | def run_cfg(self):
    method datasets_cfg (line 136) | def datasets_cfg(self):
    method model_cfg (line 140) | def model_cfg(self):
    method pretty_print (line 143) | def pretty_print(self):
    method _convert_node_to_json (line 161) | def _convert_node_to_json(self, node):
    method to_dict (line 165) | def to_dict(self):
  function node_to_dict (line 169) | def node_to_dict(node):
  class ConfigValidator (line 173) | class ConfigValidator:
    class _Argument (line 187) | class _Argument:
      method __init__ (line 188) | def __init__(self, name, choices=None, type=None, help=None):
      method __str__ (line 195) | def __str__(self):
    method __init__ (line 205) | def __init__(self, description):
    method __getitem__ (line 212) | def __getitem__(self, key):
    method __str__ (line 217) | def __str__(self) -> str:
    method add_argument (line 220) | def add_argument(self, *args, **kwargs):
    method validate (line 226) | def validate(self, config=None):
    method format_arguments (line 248) | def format_arguments(self):
    method format_help (line 251) | def format_help(self):
    method print_help (line 256) | def print_help(self):
  function create_runner_config_validator (line 261) | def create_runner_config_validator():

FILE: lavis/common/dist_utils.py
  function setup_for_distributed (line 17) | def setup_for_distributed(is_master):
  function is_dist_avail_and_initialized (line 33) | def is_dist_avail_and_initialized():
  function get_world_size (line 41) | def get_world_size():
  function get_rank (line 47) | def get_rank():
  function is_main_process (line 53) | def is_main_process():
  function init_distributed_mode (line 57) | def init_distributed_mode(args):
  function get_dist_info (line 93) | def get_dist_info():
  function main_process (line 107) | def main_process(func):
  function download_cached_file (line 117) | def download_cached_file(url, check_hash=True, progress=False):

FILE: lavis/common/gradcam.py
  function getAttMap (line 7) | def getAttMap(img, attMap, blur=True, overlap=True):

FILE: lavis/common/logger.py
  class SmoothedValue (line 19) | class SmoothedValue(object):
    method __init__ (line 24) | def __init__(self, window_size=20, fmt=None):
    method update (line 32) | def update(self, value, n=1):
    method synchronize_between_processes (line 37) | def synchronize_between_processes(self):
    method median (line 51) | def median(self):
    method avg (line 56) | def avg(self):
    method global_avg (line 61) | def global_avg(self):
    method max (line 65) | def max(self):
    method value (line 69) | def value(self):
    method __str__ (line 72) | def __str__(self):
  class MetricLogger (line 82) | class MetricLogger(object):
    method __init__ (line 83) | def __init__(self, delimiter="\t"):
    method update (line 87) | def update(self, **kwargs):
    method __getattr__ (line 94) | def __getattr__(self, attr):
    method __str__ (line 103) | def __str__(self):
    method global_avg (line 109) | def global_avg(self):
    method synchronize_between_processes (line 115) | def synchronize_between_processes(self):
    method add_meter (line 119) | def add_meter(self, name, meter):
    method log_every (line 122) | def log_every(self, iterable, print_freq, header=None):
  class AttrDict (line 184) | class AttrDict(dict):
    method __init__ (line 185) | def __init__(self, *args, **kwargs):
  function setup_logger (line 190) | def setup_logger():

FILE: lavis/common/optims.py
  class LinearWarmupStepLRScheduler (line 14) | class LinearWarmupStepLRScheduler:
    method __init__ (line 15) | def __init__(
    method step (line 37) | def step(self, cur_epoch, cur_step):
  class LinearWarmupCosineLRScheduler (line 57) | class LinearWarmupCosineLRScheduler:
    method __init__ (line 58) | def __init__(
    method step (line 77) | def step(self, cur_epoch, cur_step):
  class ConstantLRScheduler (line 98) | class ConstantLRScheduler:
    method __init__ (line 99) | def __init__(self, optimizer, init_lr, warmup_start_lr=-1, warmup_step...
    method step (line 105) | def step(self, cur_epoch, cur_step):
  function cosine_lr_schedule (line 119) | def cosine_lr_schedule(optimizer, epoch, max_epoch, init_lr, min_lr):
  function warmup_lr_schedule (line 128) | def warmup_lr_schedule(optimizer, step, max_step, init_lr, max_lr):
  function step_lr_schedule (line 135) | def step_lr_schedule(optimizer, epoch, init_lr, min_lr, decay_rate):

FILE: lavis/common/registry.py
  class Registry (line 9) | class Registry:
    method register_builder (line 22) | def register_builder(cls, name):
    method register_task (line 54) | def register_task(cls, name):
    method register_model (line 83) | def register_model(cls, name):
    method register_processor (line 112) | def register_processor(cls, name):
    method register_lr_scheduler (line 141) | def register_lr_scheduler(cls, name):
    method register_runner (line 165) | def register_runner(cls, name):
    method register_path (line 189) | def register_path(cls, name, path):
    method register (line 205) | def register(cls, name, obj):
    method get_builder_class (line 232) | def get_builder_class(cls, name):
    method get_model_class (line 236) | def get_model_class(cls, name):
    method get_task_class (line 240) | def get_task_class(cls, name):
    method get_processor_class (line 244) | def get_processor_class(cls, name):
    method get_lr_scheduler_class (line 248) | def get_lr_scheduler_class(cls, name):
    method get_runner_class (line 252) | def get_runner_class(cls, name):
    method list_runners (line 256) | def list_runners(cls):
    method list_models (line 260) | def list_models(cls):
    method list_tasks (line 264) | def list_tasks(cls):
    method list_processors (line 268) | def list_processors(cls):
    method list_lr_schedulers (line 272) | def list_lr_schedulers(cls):
    method list_datasets (line 276) | def list_datasets(cls):
    method get_path (line 280) | def get_path(cls, name):
    method get (line 284) | def get(cls, name, default=None, no_warning=False):
    method unregister (line 315) | def unregister(cls, name):

FILE: lavis/common/utils.py
  function now (line 37) | def now():
  function is_url (line 43) | def is_url(url_or_filename):
  function get_cache_path (line 48) | def get_cache_path(rel_path):
  function get_abs_path (line 52) | def get_abs_path(rel_path):
  function load_json (line 56) | def load_json(filename):
  function makedir (line 66) | def makedir(dir_path):
  function get_redirected_url (line 80) | def get_redirected_url(url: str):
  function to_google_drive_download_url (line 95) | def to_google_drive_download_url(view_url: str) -> str:
  function download_google_drive_url (line 110) | def download_google_drive_url(url: str, output_path: str, output_file_na...
  function _get_google_drive_file_id (line 143) | def _get_google_drive_file_id(url: str) -> Optional[str]:
  function _urlretrieve (line 156) | def _urlretrieve(url: str, filename: str, chunk_size: int = 1024) -> None:
  function download_url (line 169) | def download_url(
  function download_and_extract_archive (line 223) | def download_and_extract_archive(
  function cache_url (line 244) | def cache_url(url: str, cache_dir: str) -> str:
  function create_file_symlink (line 263) | def create_file_symlink(file1, file2):
  function save_file (line 277) | def save_file(data, filename, append_to_json=True, verbose=True):
  function load_file (line 315) | def load_file(filename, mmap_mode=None, verbose=True, allow_pickle=False):
  function abspath (line 376) | def abspath(resource_path: str):
  function makedir (line 388) | def makedir(dir_path):
  function is_url (line 402) | def is_url(input_url):
  function download_and_untar (line 410) | def download_and_untar(url):
  function cleanup_dir (line 426) | def cleanup_dir(dir):
  function get_file_size (line 437) | def get_file_size(filename):
  function is_serializable (line 444) | def is_serializable(value):
  function is_convertible_to_int (line 454) | def is_convertible_to_int(value):

FILE: lavis/common/vqa_tools/vqa.py
  class VQA (line 31) | class VQA:
    method __init__ (line 32) | def __init__(self, annotation_file=None, question_file=None):
    method createIndex (line 53) | def createIndex(self):
    method info (line 71) | def info(self):
    method getQuesIds (line 79) | def getQuesIds(self, imgIds=[], quesTypes=[], ansTypes=[]):
    method getImgIds (line 114) | def getImgIds(self, quesIds=[], quesTypes=[], ansTypes=[]):
    method loadQA (line 148) | def loadQA(self, ids=[]):
    method showQA (line 159) | def showQA(self, anns):
    method loadRes (line 173) | def loadRes(self, resFile, quesFile):

FILE: lavis/common/vqa_tools/vqa_eval.py
  class VQAEval (line 18) | class VQAEval:
    method __init__ (line 19) | def __init__(self, vqa=None, vqaRes=None, n=2):
    method evaluate (line 193) | def evaluate(self, quesIds=None):
    method processPunctuation (line 249) | def processPunctuation(self, inText):
    method processDigitArticle (line 261) | def processDigitArticle(self, inText):
    method setAccuracy (line 276) | def setAccuracy(self, accQA, accQuesType, accAnsType):
    method setEvalQA (line 292) | def setEvalQA(self, quesId, acc):
    method setEvalQuesType (line 295) | def setEvalQuesType(self, quesId, quesType, acc):
    method setEvalAnsType (line 300) | def setEvalAnsType(self, quesId, ansType, acc):
    method updateProgress (line 305) | def updateProgress(self, progress):

FILE: lavis/datasets/builders/__init__.py
  function load_dataset (line 230) | def load_dataset(name, cfg_path=None, vis_path=None, data_type=None):
  class DatasetZoo (line 268) | class DatasetZoo:
    method __init__ (line 269) | def __init__(self) -> None:
    method get_names (line 275) | def get_names(self):

FILE: lavis/datasets/builders/audio_caption_builder.py
  class AudioCapBuilder (line 27) | class AudioCapBuilder(MultiModalDatasetBuilder):
    method build (line 36) | def build(self):
  class AudioSetBuilder (line 49) | class AudioSetBuilder(AudioCapBuilder):
  class AudioSetInstructBuilder (line 58) | class AudioSetInstructBuilder(AudioCapBuilder):
  class AudioCapsCapBuilder (line 67) | class AudioCapsCapBuilder(AudioCapBuilder):
  class AudioCapsInstructCapBuilder (line 76) | class AudioCapsInstructCapBuilder(AudioCapBuilder):
  class ClothoCapInstructBuilder (line 85) | class ClothoCapInstructBuilder(MultiModalDatasetBuilder):
  class ClothoCapInstructBuilder (line 94) | class ClothoCapInstructBuilder(MultiModalDatasetBuilder):
  class WavCapsCapBuilder (line 104) | class WavCapsCapBuilder(AudioCapBuilder):
  class WavCapsCapInstructBuilder (line 115) | class WavCapsCapInstructBuilder(AudioCapBuilder):

FILE: lavis/datasets/builders/audio_qa_builder.py
  class AudioCapsQABuilder (line 13) | class AudioCapsQABuilder(AudioCapBuilder):
  class ClothoQABuilder (line 22) | class ClothoQABuilder(AudioCapBuilder):

FILE: lavis/datasets/builders/base_dataset_builder.py
  class BaseDatasetBuilder (line 23) | class BaseDatasetBuilder:
    method __init__ (line 26) | def __init__(self, cfg=None):
    method build_datasets (line 46) | def build_datasets(self):
    method build_processors (line 62) | def build_processors(self):
    method _build_proc_from_cfg (line 86) | def _build_proc_from_cfg(cfg):
    method default_config_path (line 94) | def default_config_path(cls, type="default"):
    method _download_data (line 97) | def _download_data(self):
    method _download_ann (line 101) | def _download_ann(self):
    method _download_vis (line 158) | def _download_vis(self):
    method build (line 172) | def build(self):
  class MultiModalDatasetBuilder (line 238) | class MultiModalDatasetBuilder(BaseDatasetBuilder):
    method __init__ (line 247) | def __init__(self, cfg=None):
    method _build_processor (line 252) | def _build_processor(self, cfg_name):
    method build_processors (line 261) | def build_processors(self):
    method _download_multimodal (line 274) | def _download_multimodal(self, modality):
    method _download_data (line 279) | def _download_data(self):
    method _get_absolute_path (line 284) | def _get_absolute_path(self, path):
    method build (line 289) | def build(self):
    method _get_dataset_args (line 306) | def _get_dataset_args(self, info, is_train):
  function load_dataset_config (line 325) | def load_dataset_config(cfg_path):

FILE: lavis/datasets/builders/caption_builder.py
  class COCOCapBuilder (line 40) | class COCOCapBuilder(BaseDatasetBuilder):
  class COCOCapInstructBuilder (line 49) | class COCOCapInstructBuilder(BaseDatasetBuilder):
  class Flickr30kCapBuilder (line 59) | class Flickr30kCapBuilder(BaseDatasetBuilder):
  class Flickr30kCapInstructBuilder (line 67) | class Flickr30kCapInstructBuilder(BaseDatasetBuilder):
  class COCOCapBuilder (line 75) | class COCOCapBuilder(BaseDatasetBuilder):
  class VSRCapBuilder (line 83) | class VSRCapBuilder(BaseDatasetBuilder):
  class VSRCapInstructBuilder (line 92) | class VSRCapInstructBuilder(BaseDatasetBuilder):
  class TextCapsCapBuilder (line 101) | class TextCapsCapBuilder(BaseDatasetBuilder):
  class TextCapsCapInstructBuilder (line 110) | class TextCapsCapInstructBuilder(BaseDatasetBuilder):
  class CapFiltCapBuilder (line 120) | class CapFiltCapBuilder(BaseDatasetBuilder):
  class CapFiltCapBuilder (line 128) | class CapFiltCapBuilder(BaseDatasetBuilder):
  class MSRVTTCapBuilder (line 137) | class MSRVTTCapBuilder(BaseDatasetBuilder):
  class MSVDCapBuilder (line 147) | class MSVDCapBuilder(BaseDatasetBuilder):
  class VATEXCapBuilder (line 157) | class VATEXCapBuilder(MultiModalDatasetBuilder):
  class MSRVTTCapInstructBuilder (line 166) | class MSRVTTCapInstructBuilder(BaseDatasetBuilder):
  class MSVDCapInstructBuilder (line 175) | class MSVDCapInstructBuilder(BaseDatasetBuilder):
  class VATEXCapInstructBuilder (line 186) | class VATEXCapInstructBuilder(MultiModalDatasetBuilder):
  class WebVid2MCapBuilder (line 196) | class WebVid2MCapBuilder(BaseDatasetBuilder):
  class WebVid2MCapInstructBuilder (line 204) | class WebVid2MCapInstructBuilder(BaseDatasetBuilder):
  class ViolinCapBuilder (line 212) | class ViolinCapBuilder(BaseDatasetBuilder):
  class ViolinCapInstructBuilder (line 222) | class ViolinCapInstructBuilder(BaseDatasetBuilder):
  class VALORCaptionBuilder (line 231) | class VALORCaptionBuilder(MultiModalDatasetBuilder):
  class VALORCaptionInstructBuilder (line 240) | class VALORCaptionInstructBuilder(MultiModalDatasetBuilder):
  class VlepCaptionBuilder (line 249) | class VlepCaptionBuilder(BaseDatasetBuilder):
  class VlepCaptionInstructBuilder (line 259) | class VlepCaptionInstructBuilder(BaseDatasetBuilder):
  class YouCookCaptionBuilder (line 268) | class YouCookCaptionBuilder(BaseDatasetBuilder):
  class YouCookCaptionInstructBuilder (line 277) | class YouCookCaptionInstructBuilder(BaseDatasetBuilder):
  class COINCaptionBuilder (line 286) | class COINCaptionBuilder(BaseDatasetBuilder):
  class COINCaptionInstructBuilder (line 296) | class COINCaptionInstructBuilder(BaseDatasetBuilder):
  class CharadeCaptionBuilder (line 306) | class CharadeCaptionBuilder(BaseDatasetBuilder):
  class CharadeCaptionInstructBuilder (line 315) | class CharadeCaptionInstructBuilder(BaseDatasetBuilder):

FILE: lavis/datasets/builders/classification_builder.py
  class ViolinEntailmentBuilder (line 16) | class ViolinEntailmentBuilder(BaseDatasetBuilder):
  class ViolinEntailmentInstructBuilder (line 26) | class ViolinEntailmentInstructBuilder(BaseDatasetBuilder):
  class NLVRBuilder (line 35) | class NLVRBuilder(BaseDatasetBuilder):
  class SNLIVisualEntailmentBuilder (line 43) | class SNLIVisualEntailmentBuilder(BaseDatasetBuilder):
  class SNLIVisualEntailmentInstructBuilder (line 50) | class SNLIVisualEntailmentInstructBuilder(BaseDatasetBuilder):
  class VSRClassificationBuilder (line 58) | class VSRClassificationBuilder(BaseDatasetBuilder):
  class SNLIVisualEntailmentInstructBuilder (line 65) | class SNLIVisualEntailmentInstructBuilder(BaseDatasetBuilder):
  class ESC50ClassificationBuilder (line 72) | class ESC50ClassificationBuilder(MultiModalDatasetBuilder):

FILE: lavis/datasets/builders/dialogue_builder.py
  class AVSDDialBuilder (line 26) | class AVSDDialBuilder(BaseDatasetBuilder):
  class VisDialBuilder (line 33) | class VisDialBuilder(BaseDatasetBuilder):
  class VisDialInstructBuilder (line 40) | class VisDialInstructBuilder(BaseDatasetBuilder):
  class AVSDDialInstructBuilder (line 47) | class AVSDDialInstructBuilder(MultiModalDatasetBuilder):
  class LLaVA150kDialInstructBuilder (line 54) | class LLaVA150kDialInstructBuilder(BaseDatasetBuilder):
  class YT8MDialBuilder (line 61) | class YT8MDialBuilder(MultiModalDatasetBuilder):

FILE: lavis/datasets/builders/discrn_builders.py
  class DiscrnImagePcBuilder (line 15) | class DiscrnImagePcBuilder(MultiModalDatasetBuilder):
  class DiscrnAudioVideoBuilder (line 23) | class DiscrnAudioVideoBuilder(MultiModalDatasetBuilder):

FILE: lavis/datasets/builders/image_text_pair_builder.py
  class ConceptualCaption3MBuilder (line 16) | class ConceptualCaption3MBuilder(BaseDatasetBuilder):
  class ConceptualCaption3MInstructBuilder (line 24) | class ConceptualCaption3MInstructBuilder(BaseDatasetBuilder):
  class ConceptualCaption12MBuilder (line 33) | class ConceptualCaption12MBuilder(BaseDatasetBuilder):
  class ConceptualCaption12MInstructBuilder (line 41) | class ConceptualCaption12MInstructBuilder(BaseDatasetBuilder):
  class SBUCaptionBuilder (line 49) | class SBUCaptionBuilder(BaseDatasetBuilder):
  class SBUCaptionInstructBuilder (line 56) | class SBUCaptionInstructBuilder(BaseDatasetBuilder):
  class VGCaptionBuilder (line 63) | class VGCaptionBuilder(BaseDatasetBuilder):
  class VGCaptionInstructBuilder (line 70) | class VGCaptionInstructBuilder(BaseDatasetBuilder):
  class Laion2BMultiBuilder (line 78) | class Laion2BMultiBuilder(BaseDatasetBuilder):
    method _download_ann (line 83) | def _download_ann(self):
    method _download_vis (line 86) | def _download_vis(self):
    method build (line 89) | def build(self):
  class Laion400MBuilder (line 109) | class Laion400MBuilder(Laion2BMultiBuilder):
  class Laion400MInstructBuilder (line 116) | class Laion400MInstructBuilder(Laion2BMultiBuilder):

FILE: lavis/datasets/builders/imagefolder_builder.py
  class ImageNetBuilder (line 16) | class ImageNetBuilder(BaseDatasetBuilder):
    method _download_ann (line 22) | def _download_ann(self):
    method build (line 25) | def build(self):

FILE: lavis/datasets/builders/object3d_caption_builder.py
  class ObjaverseCaptionBuilder (line 20) | class ObjaverseCaptionBuilder(MultiModalDatasetBuilder):
    method build (line 28) | def build(self):
  class ObjaverseCaptionInstructBuilder (line 41) | class ObjaverseCaptionInstructBuilder(ObjaverseCaptionBuilder):
  class ShapenetCaptionBuilder (line 50) | class ShapenetCaptionBuilder(ObjaverseCaptionBuilder):
  class ShapenetCaptionInstructBuilder (line 59) | class ShapenetCaptionInstructBuilder(ObjaverseCaptionBuilder):

FILE: lavis/datasets/builders/object3d_classification_builder.py
  class ModelNetClassificationBuilder (line 13) | class ModelNetClassificationBuilder(MultiModalDatasetBuilder):

FILE: lavis/datasets/builders/object3d_qa_builder.py
  class ObjaverseQABuilder (line 13) | class ObjaverseQABuilder(ObjaverseCaptionBuilder):

FILE: lavis/datasets/builders/retrieval_builder.py
  class MSRVTTRetrievalBuilder (line 20) | class MSRVTTRetrievalBuilder(BaseDatasetBuilder):
  class DiDeMoRetrievalBuilder (line 28) | class DiDeMoRetrievalBuilder(BaseDatasetBuilder):
  class COCORetrievalBuilder (line 36) | class COCORetrievalBuilder(BaseDatasetBuilder):
  class Flickr30kBuilder (line 44) | class Flickr30kBuilder(BaseDatasetBuilder):

FILE: lavis/datasets/builders/text_to_image_generation_builder.py
  class BlipDiffusionFinetuneBuilder (line 16) | class BlipDiffusionFinetuneBuilder(BaseDatasetBuilder):
    method _download_ann (line 23) | def _download_ann(self):
    method build (line 26) | def build(self):

FILE: lavis/datasets/builders/video_qa_builder.py
  class VideoQABuilder (line 15) | class VideoQABuilder(BaseDatasetBuilder):
    method build (line 19) | def build(self):
  class MSRVTTQABuilder (line 35) | class MSRVTTQABuilder(VideoQABuilder):
  class MSVDQABuilder (line 42) | class MSVDQABuilder(VideoQABuilder):
  class MSRVTTQAInstructBuilder (line 49) | class MSRVTTQAInstructBuilder(VideoQABuilder):
  class MSVDQAInstructBuilder (line 58) | class MSVDQAInstructBuilder(VideoQABuilder):
  class MusicAVQABuilder (line 66) | class MusicAVQABuilder(MultiModalDatasetBuilder):
  class MusicAVQAInstructBuilder (line 73) | class MusicAVQAInstructBuilder(MultiModalDatasetBuilder):

FILE: lavis/datasets/builders/vqa_builder.py
  class COCOVQABuilder (line 20) | class COCOVQABuilder(BaseDatasetBuilder):
  class COCOVQAInstructBuilder (line 30) | class COCOVQAInstructBuilder(BaseDatasetBuilder):
  class VGVQABuilder (line 40) | class VGVQABuilder(BaseDatasetBuilder):
  class VGVQAInstructBuilder (line 45) | class VGVQAInstructBuilder(BaseDatasetBuilder):
  class OKVQABuilder (line 50) | class OKVQABuilder(COCOVQABuilder):
  class OKVQAInstructBuilder (line 56) | class OKVQAInstructBuilder(COCOVQAInstructBuilder):
  class AOKVQABuilder (line 62) | class AOKVQABuilder(BaseDatasetBuilder):
  class AOKVQAInstructBuilder (line 69) | class AOKVQAInstructBuilder(BaseDatasetBuilder):
  class GQABuilder (line 77) | class GQABuilder(BaseDatasetBuilder):
  class GQAInstructBuilder (line 88) | class GQAInstructBuilder(BaseDatasetBuilder):
  class IconQABuilder (line 99) | class IconQABuilder(BaseDatasetBuilder):
  class IconQAInstructBuilder (line 108) | class IconQAInstructBuilder(BaseDatasetBuilder):
  class ScienceQABuilder (line 117) | class ScienceQABuilder(BaseDatasetBuilder):
  class ScienceQAInstructBuilder (line 124) | class ScienceQAInstructBuilder(BaseDatasetBuilder):
  class OCRVQABuilder (line 131) | class OCRVQABuilder(BaseDatasetBuilder):
  class OCRVQAInstructBuilder (line 138) | class OCRVQAInstructBuilder(BaseDatasetBuilder):
  class VizWizVQABuilder (line 146) | class VizWizVQABuilder(BaseDatasetBuilder):

FILE: lavis/datasets/data_utils.py
  function load_video (line 30) | def load_video(video_path, n_frms=MAX_INT, height=-1, width=-1, sampling...
  function apply_to_sample (line 53) | def apply_to_sample(f, sample):
  function move_to_cuda (line 71) | def move_to_cuda(sample):
  function prepare_sample (line 78) | def prepare_sample(samples, cuda_enabled=True):
  function reorg_datasets_by_split (line 87) | def reorg_datasets_by_split(datasets):
  function concat_datasets (line 113) | def concat_datasets(datasets):
  function extract_archive (line 180) | def extract_archive(from_path, to_path=None, overwrite=False):
  function save_frames_grid (line 262) | def save_frames_grid(img_array, out_path):
  function uniform_frame_sampling (line 289) | def uniform_frame_sampling(video_path, num_frames, target_height, target...
  function head_tail_frame_sampling (line 316) | def head_tail_frame_sampling(video_path, num_frames, target_height, targ...
  function load_clip (line 345) | def load_clip(video_path, num_frames, target_height, target_width, start...

FILE: lavis/datasets/datasets/aok_vqa_datasets.py
  class __DisplMixin (line 19) | class __DisplMixin:
    method displ_item (line 20) | def displ_item(self, index):
  class AOKVQADataset (line 35) | class AOKVQADataset(VQADataset, __DisplMixin):
    method __init__ (line 36) | def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
    method __getitem__ (line 39) | def __getitem__(self, index):
  class AOKVQAInstructDataset (line 67) | class AOKVQAInstructDataset(AOKVQADataset):
    method __getitem__ (line 68) | def __getitem__(self, index):
    method collater (line 74) | def collater(self, samples):
  class AOKVQAEvalDataset (line 80) | class AOKVQAEvalDataset(VQAEvalDataset, __DisplMixin):
    method __init__ (line 81) | def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
    method collater (line 109) | def collater(self, samples):
    method __getitem__ (line 139) | def __getitem__(self, index):

FILE: lavis/datasets/datasets/audio_captioning_datasets.py
  class __DisplMixin (line 22) | class __DisplMixin:
    method displ_item (line 23) | def displ_item(self, index):
  class AudioCaptioningDataset (line 38) | class AudioCaptioningDataset(BaseDataset, __DisplMixin):
    method __init__ (line 39) | def __init__(self, **kwargs):
    method get_audio_path (line 47) | def get_audio_path(self, ann):
    method is_empty_audio (line 50) | def is_empty_audio(self, ann):
    method get_existing_audio_annotations (line 64) | def get_existing_audio_annotations(self):
    method get_existing_video_annotations (line 67) | def get_existing_video_annotations(self):
    method get_existing_images_annotations (line 70) | def get_existing_images_annotations(self):
    method get_video_path (line 73) | def get_video_path(self, ann):
    method get_images_path (line 76) | def get_images_path(self, ann):
    method __len__ (line 79) | def __len__(self):
    method __getitem__ (line 82) | def __getitem__(self, index):
    method _build_templates (line 85) | def _build_templates(self, templates_path):
  class AudioSetDataset (line 93) | class AudioSetDataset(AudioCaptioningDataset):
    method __init__ (line 94) | def __init__(self, **kwargs):
    method get_audio_path (line 113) | def get_audio_path(self, ann):
    method __getitem__ (line 121) | def __getitem__(self, index):
  class AudioSetInstructDataset (line 149) | class AudioSetInstructDataset(AudioSetDataset):
    method __getitem__ (line 150) | def __getitem__(self, index):
  class AudioSetEvalDataset (line 157) | class AudioSetEvalDataset(AudioSetDataset):
    method __getitem__ (line 158) | def __getitem__(self, index):
  class AudioCapsDataset (line 164) | class AudioCapsDataset(AudioCaptioningDataset):
    method __init__ (line 165) | def __init__(self, **kwargs):
    method get_audio_path (line 182) | def get_audio_path(self, ann):
    method get_cached_audio_path (line 188) | def get_cached_audio_path(self, ann):
    method __getitem__ (line 194) | def __getitem__(self, index):
  class AudioCapsInstructDataset (line 219) | class AudioCapsInstructDataset(AudioCapsDataset):
    method __getitem__ (line 220) | def __getitem__(self, index):
  class AudioCapsEvalDataset (line 227) | class AudioCapsEvalDataset(AudioCapsDataset):
    method __init__ (line 228) | def __init__(self, **kwargs):
    method __getitem__ (line 233) | def __getitem__(self, index):
  class ClothoV2Dataset (line 239) | class ClothoV2Dataset(BaseDataset, __DisplMixin):
    method __init__ (line 240) | def __init__(self, **kwargs):
    method __getitem__ (line 260) | def __getitem__(self, index):
  class ClothoV2InstructDataset (line 269) | class ClothoV2InstructDataset(ClothoV2Dataset):
    method __getitem__ (line 270) | def __getitem__(self, index):
  class ClothoV2EvalDataset (line 277) | class ClothoV2EvalDataset(ClothoV2Dataset):
    method __getitem__ (line 278) | def __getitem__(self, index):
  class AudioLanguagePretrainDataset (line 313) | class AudioLanguagePretrainDataset(BaseDataset, __DisplMixin):
    method __init__ (line 314) | def __init__(self, **kwargs):
    method _load_json_file (line 326) | def _load_json_file(self, files, audio_root, blacklist=None):
    method __len__ (line 366) | def __len__(self):
    method __getitem__ (line 369) | def __getitem__(self, index):
    method _build_templates (line 391) | def _build_templates(self, templates_path)

Copy disabled (too large) Download .json

Condensed preview — 1383 files, each showing path, character count, and a content snippet. Download the .json file for the full structured content (34,443K chars).

[
  {
    "path": ".github/workflows/docs.yaml",
    "chars": 888,
    "preview": "name: docs\n\non:\n  push:\n    branches: [ main ]\n  pull_request:\n    branches: [ main ]\n  release:\n    types: [ published "
  },
  {
    "path": ".gitignore",
    "chars": 2472,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 638,
    "preview": "repos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v4.1.0\n    hooks:\n    -   id: trailing-whitespa"
  },
  {
    "path": "CODEOWNERS",
    "chars": 139,
    "preview": "# Comment line immediately above ownership line is reserved for related gus information. Please be careful while editing"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 5154,
    "preview": "# Salesforce Open Source Community Code of Conduct\n\n## About the Code of Conduct\n\nEquality is a core value at Salesforce"
  },
  {
    "path": "LICENSE.txt",
    "chars": 1502,
    "preview": "BSD 3-Clause License\n\nCopyright (c) 2022 Salesforce, Inc.\nAll rights reserved.\n\nRedistribution and use in source and bin"
  },
  {
    "path": "MANIFEST.in",
    "chars": 267,
    "preview": "recursive-include lavis/configs *.yaml *.json\nrecursive-include lavis/projects *.yaml *.json\n\nrecursive-exclude lavis/da"
  },
  {
    "path": "README.md",
    "chars": 20113,
    "preview": "<p align=\"center\">\n    <br>\n    <img src=\"docs/_static/logo_final.png\" width=\"400\"/>\n    <br>\n<p>\n\n<div align=\"center\">\n"
  },
  {
    "path": "SECURITY.md",
    "chars": 400,
    "preview": "## Security\n\nPlease report any security issue to [security@salesforce.com](mailto:security@salesforce.com)\nas soon as it"
  },
  {
    "path": "app/__init__.py",
    "chars": 666,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/calculate_coco_features.py",
    "chars": 2380,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/caption.py",
    "chars": 2771,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/classification.py",
    "chars": 8076,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/dataset_browser.py",
    "chars": 7375,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/image_text_match.py",
    "chars": 2825,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/main.py",
    "chars": 819,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/multimodal_search.py",
    "chars": 7818,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/multipage.py",
    "chars": 1318,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/text_localization.py",
    "chars": 3457,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/utils.py",
    "chars": 2226,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "app/vqa.py",
    "chars": 1967,
    "preview": "\"\"\"\n # Copyright (c) 2022, salesforce.com, inc.\n # All rights reserved.\n # SPDX-License-Identifier: BSD-3-Clause\n # For "
  },
  {
    "path": "dataset_card/avsd_dialogue.md",
    "chars": 1873,
    "preview": "![Samples from the AVSD dataset (Image credit: \"https://arxiv.org/pdf/1901.09107.pdf\").](imgs/avsd_dialogue.png)(Samples"
  },
  {
    "path": "dataset_card/coco_caption.md",
    "chars": 3848,
    "preview": "![Samples from the COCO Caption dataset (Image credit: \"https://arxiv.org/pdf/1504.00325.pdf\").](imgs/coco_caption.png)("
  },
  {
    "path": "dataset_card/coco_retrieval.md",
    "chars": 4037,
    "preview": "![Samples from the COCO Caption dataset (Image credit: \"https://arxiv.org/pdf/1504.00325.pdf\").](imgs/coco_caption.png)("
  },
  {
    "path": "dataset_card/conceptual_captions.md",
    "chars": 2181,
    "preview": "![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/conceptual_captions.png)\n(image credit: https://ai.google.com/researc"
  },
  {
    "path": "dataset_card/didemo_retrieval.md",
    "chars": 4041,
    "preview": "![Samples from the DiDeMo dataset.](imgs/didemo.png)(Samples from the DiDeMo dataset. Image credit: \"https://www.di.ens."
  },
  {
    "path": "dataset_card/flickr_retrieval.md",
    "chars": 3771,
    "preview": "![Samples from Flickr30k dataset (Image credit: \"https://bryanplummer.com/Flickr30kEntities/\").](imgs/flickr30k.png)Samp"
  },
  {
    "path": "dataset_card/gqa.md",
    "chars": 909,
    "preview": "![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png)\n\n# GQA Dataset\n\n## Description\n(from https://cs.stanford.edu"
  },
  {
    "path": "dataset_card/msrvtt_qa.md",
    "chars": 3439,
    "preview": "![Samples from MSRVTT-QA dataset.](imgs/msrvtt_qa.png)(Samples from MSRVTT-QA dataset, image credit: http://staff.ustc.e"
  },
  {
    "path": "dataset_card/msrvtt_retrieval.md",
    "chars": 2375,
    "preview": "![Samples from Flickr30k dataset (Image credit: \"https://bryanplummer.com/Flickr30kEntities/\").](imgs/msrvtt.png)\n\n# MSR"
  },
  {
    "path": "dataset_card/msvd_qa.md",
    "chars": 2984,
    "preview": "![Samples from MSVD-QA dataset.](imgs/msvd_qa.png)(Samples from MSVD-QA dataset, image credit: http://staff.ustc.edu.cn/"
  },
  {
    "path": "dataset_card/nlvr2.md",
    "chars": 3441,
    "preview": "![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/NLVR2.png)\n\n# Natural Language for Visual Reasoning for Real (NLVR2)\n"
  },
  {
    "path": "dataset_card/nocaps.md",
    "chars": 3178,
    "preview": "![Samples from the COCO Caption dataset (Image credit: \"https://arxiv.org/pdf/1504.00325.pdf\").](imgs/nocaps.png)\n\n# Noc"
  },
  {
    "path": "dataset_card/sbu_caption.md",
    "chars": 770,
    "preview": "![sbu caption](imgs/sbu_caption.png)\n(image credit: http://tamaraberg.com/papers/generation_nips2011.pdf)\n\n# SBU Caption"
  },
  {
    "path": "dataset_card/snli_visual_entailment.md",
    "chars": 3352,
    "preview": "![From https://github.com/necla-ml/SNLI-VE.](imgs/snli_ve.png)\n\n# SNLI-VE: Visual Entailment Dataset\n\n## Description\n(fr"
  },
  {
    "path": "dataset_card/vqav2.md",
    "chars": 3420,
    "preview": "![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/vqav2.png)\n\n# Microsoft COCO Dataset (VQAv2)\n\n## Description\n(from ht"
  },
  {
    "path": "docs/Makefile",
    "chars": 638,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
  },
  {
    "path": "docs/benchmark.rst",
    "chars": 14221,
    "preview": "Benchmark\n############\n\nWe provide scripts for evaluating and training models on task datasets. The following benchmark "
  },
  {
    "path": "docs/build_docs.sh",
    "chars": 3319,
    "preview": "#!/bin/bash\nset -euo pipefail\n\n# Change to root directory of repo\nDIRNAME=$(cd \"$( dirname \"${BASH_SOURCE[0]}\" )\" &> /de"
  },
  {
    "path": "docs/conf.py",
    "chars": 1974,
    "preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
  },
  {
    "path": "docs/getting_started.rst",
    "chars": 10301,
    "preview": "Dataset Zoo\n##################\nLAVIS inherently supports a wide variety of common language-vision datasets by providing "
  },
  {
    "path": "docs/index.rst",
    "chars": 731,
    "preview": ".. LAVIS documentation master file, created by\n   sphinx-quickstart on Sun Jul 31 10:32:27 2022.\n   You can adapt this f"
  },
  {
    "path": "docs/intro.rst",
    "chars": 7295,
    "preview": "What is LAVIS?\n####################################\n\nLAVIS is a Python deep learning library for LAnguage-and-VISion res"
  },
  {
    "path": "docs/make.bat",
    "chars": 799,
    "preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
  },
  {
    "path": "docs/requirements.txt",
    "chars": 91,
    "preview": "GitPython\nipykernel\nnbsphinx==0.8.7\npandoc\nsphinx\nsphinx_autodoc_typehints\nsphinx_rtd_theme"
  },
  {
    "path": "docs/tutorial.configs.rst",
    "chars": 5718,
    "preview": ".. _config:\n\nTraining Models on Task Datasets (Commands and Configurations) \n###########################################"
  },
  {
    "path": "docs/tutorial.datasets.rst",
    "chars": 20543,
    "preview": "Adding Datasets\n################################################\n\nThis is a tutorial on adding a new dataset using ``lav"
  },
  {
    "path": "docs/tutorial.evaluation.rst",
    "chars": 1289,
    "preview": "Evaluating Pre-trained Models on Task Datasets\n###############################################\nLAVIS provides pre-traine"
  },
  {
    "path": "docs/tutorial.models.rst",
    "chars": 10799,
    "preview": "Adding Models\n####################################\n\nThis is a tutorial on adding new models using ``lavis.models`` modul"
  },
  {
    "path": "docs/tutorial.processors.rst",
    "chars": 10575,
    "preview": "Adding Processors\n################################################\n\nThis is a tutorial on adding new processors using ``"
  },
  {
    "path": "docs/tutorial.rst",
    "chars": 225,
    "preview": "Tutorials\n==============================\n\n.. toctree::\n   :maxdepth: 1\n\n   tutorial.evaluation\n   tutorial.training-exam"
  },
  {
    "path": "docs/tutorial.tasks.rst",
    "chars": 6989,
    "preview": "Adding Tasks\n####################################\n\nThis is a tutorial on adding new machine learning tasks using ``lavis"
  },
  {
    "path": "docs/tutorial.training-example.rst",
    "chars": 7134,
    "preview": "Example on Finetuning BLIP on COCO-Captioning\n################################################\n\nTo finetune BLIP model o"
  },
  {
    "path": "evaluate.py",
    "chars": 2393,
    "preview": "\"\"\"\n Copyright (c) 2022, salesforce.com, inc.\n All rights reserved.\n SPDX-License-Identifier: BSD-3-Clause\n For full lic"
  },
  {
    "path": "examples/albef_feature_extraction.ipynb",
    "chars": 508093,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/albef_vqa.ipynb",
    "chars": 1631503,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/albef_zero_shot_classification.ipynb",
    "chars": 507969,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/blip2_feature_extraction.ipynb",
    "chars": 3532,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": "
  },
  {
    "path": "examples/blip2_image_text_matching.ipynb",
    "chars": 3301,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": "
  },
  {
    "path": "examples/blip2_instructed_generation.ipynb",
    "chars": 511084,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Large RAM is required to load "
  },
  {
    "path": "examples/blip_feature_extraction.ipynb",
    "chars": 507895,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/blip_image_captioning.ipynb",
    "chars": 507838,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/blip_image_text_matching.ipynb",
    "chars": 507491,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n"
  },
  {
    "path": "examples/blip_text_localization.ipynb",
    "chars": 1863151,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/blip_vqa.ipynb",
    "chars": 509204,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/blip_zero_shot_classification.ipynb",
    "chars": 508188,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n"
  },
  {
    "path": "examples/clip_feature_extraction.ipynb",
    "chars": 507287,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n"
  },
  {
    "path": "examples/clip_zero_shot_classification.ipynb",
    "chars": 508032,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n"
  },
  {
    "path": "lavis/__init__.py",
    "chars": 930,
    "preview": "\"\"\"\n Copyright (c) 2022, salesforce.com, inc.\n All rights reserved.\n SPDX-License-Identifier: BSD-3-Clause\n For full lic"
  },
  {
    "path": "lavis/common/annotator/canny/__init__.py",
    "chars": 155,
    "preview": "import cv2\n\n\nclass CannyDetector:\n    def __call__(self, img, low_threshold, high_threshold):\n        return cv2.Canny(i"
  },
  {
    "path": "lavis/common/annotator/ckpts/download.sh",
    "chars": 222,
    "preview": "#! /bin/bash\n\nwget https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/dpt_hybrid-midas-501f0c75.p"
  },
  {
    "path": "lavis/common/annotator/hed/__init__.py",
    "chars": 6585,
    "preview": "import numpy as np\nimport cv2\nimport os\nimport torch\nfrom einops import rearrange\nfrom annotator.util import annotator_c"
  },
  {
    "path": "lavis/common/annotator/midas/__init__.py",
    "chars": 1400,
    "preview": "import cv2\nimport numpy as np\nimport torch\n\nfrom einops import rearrange\nfrom .api import MiDaSInference\n\n\nclass MidasDe"
  },
  {
    "path": "lavis/common/annotator/midas/api.py",
    "chars": 5229,
    "preview": "# based on https://github.com/isl-org/MiDaS\n\nimport cv2\nimport os\nimport torch\nimport torch.nn as nn\nfrom torchvision.tr"
  },
  {
    "path": "lavis/common/annotator/midas/midas/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lavis/common/annotator/midas/midas/base_model.py",
    "chars": 367,
    "preview": "import torch\n\n\nclass BaseModel(torch.nn.Module):\n    def load(self, path):\n        \"\"\"Load model from file.\n\n        Arg"
  },
  {
    "path": "lavis/common/annotator/midas/midas/blocks.py",
    "chars": 9242,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom .vit import (\n    _make_pretrained_vitb_rn50_384,\n    _make_pretrained_vitl16_3"
  },
  {
    "path": "lavis/common/annotator/midas/midas/dpt_depth.py",
    "chars": 3154,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_model import BaseModel\nfrom .blocks impor"
  },
  {
    "path": "lavis/common/annotator/midas/midas/midas_net.py",
    "chars": 2709,
    "preview": "\"\"\"MidashNet: Network for monocular depth estimation trained by mixing several datasets.\nThis file contains code that is"
  },
  {
    "path": "lavis/common/annotator/midas/midas/midas_net_custom.py",
    "chars": 5207,
    "preview": "\"\"\"MidashNet: Network for monocular depth estimation trained by mixing several datasets.\nThis file contains code that is"
  },
  {
    "path": "lavis/common/annotator/midas/midas/transforms.py",
    "chars": 7869,
    "preview": "import numpy as np\nimport cv2\nimport math\n\n\ndef apply_min_size(sample, size, image_interpolation_method=cv2.INTER_AREA):"
  },
  {
    "path": "lavis/common/annotator/midas/midas/vit.py",
    "chars": 14625,
    "preview": "import torch\nimport torch.nn as nn\nimport timm\nimport types\nimport math\nimport torch.nn.functional as F\n\n\nclass Slice(nn"
  },
  {
    "path": "lavis/common/annotator/midas/utils.py",
    "chars": 4582,
    "preview": "\"\"\"Utils for monoDepth.\"\"\"\nimport sys\nimport re\nimport numpy as np\nimport cv2\nimport torch\n\n\ndef read_pfm(path):\n    \"\"\""
  },
  {
    "path": "lavis/common/annotator/mlsd/__init__.py",
    "chars": 1458,
    "preview": "import cv2\nimport numpy as np\nimport torch\nimport os\n\nfrom einops import rearrange\nfrom .models.mbv2_mlsd_tiny import Mo"
  },
  {
    "path": "lavis/common/annotator/mlsd/models/mbv2_mlsd_large.py",
    "chars": 9678,
    "preview": "import os\nimport sys\nimport torch\nimport torch.nn as nn\nimport torch.utils.model_zoo as model_zoo\nfrom  torch.nn import "
  },
  {
    "path": "lavis/common/annotator/mlsd/models/mbv2_mlsd_tiny.py",
    "chars": 9180,
    "preview": "import os\nimport sys\nimport torch\nimport torch.nn as nn\nimport torch.utils.model_zoo as model_zoo\nfrom  torch.nn import "
  },
  {
    "path": "lavis/common/annotator/mlsd/utils.py",
    "chars": 24049,
    "preview": "'''\nmodified by  lihaoweicv\npytorch version\n'''\n\n'''\nM-LSD\nCopyright 2021-present NAVER Corp.\nApache License v2.0\n'''\n\ni"
  },
  {
    "path": "lavis/common/annotator/openpose/__init__.py",
    "chars": 1957,
    "preview": "import os\nos.environ[\"KMP_DUPLICATE_LIB_OK\"]=\"TRUE\"\n\nimport torch\nimport numpy as np\nfrom . import util\nfrom .body impor"
  },
  {
    "path": "lavis/common/annotator/openpose/body.py",
    "chars": 10994,
    "preview": "import cv2\nimport numpy as np\nimport math\nimport time\nfrom scipy.ndimage.filters import gaussian_filter\nimport matplotli"
  },
  {
    "path": "lavis/common/annotator/openpose/hand.py",
    "chars": 3426,
    "preview": "import cv2\nimport json\nimport numpy as np\nimport math\nimport time\nfrom scipy.ndimage.filters import gaussian_filter\nimpo"
  },
  {
    "path": "lavis/common/annotator/openpose/model.py",
    "chars": 8745,
    "preview": "import torch\nfrom collections import OrderedDict\n\nimport torch\nimport torch.nn as nn\n\ndef make_layers(block, no_relu_lay"
  },
  {
    "path": "lavis/common/annotator/openpose/util.py",
    "chars": 7507,
    "preview": "import math\nimport numpy as np\nimport matplotlib\nimport cv2\n\n\ndef padRightDownCorner(img, stride, padValue):\n    h = img"
  },
  {
    "path": "lavis/common/annotator/uniformer/__init__.py",
    "chars": 1070,
    "preview": "import os\n\nfrom annotator.uniformer.mmseg.apis import init_segmentor, inference_segmentor, show_result_pyplot\nfrom annot"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/ade20k.py",
    "chars": 1844,
    "preview": "# dataset settings\ndataset_type = 'ADE20KDataset'\ndata_root = 'data/ade/ADEChallengeData2016'\nimg_norm_cfg = dict(\n    m"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/chase_db1.py",
    "chars": 1924,
    "preview": "# dataset settings\ndataset_type = 'ChaseDB1Dataset'\ndata_root = 'data/CHASE_DB1'\nimg_norm_cfg = dict(\n    mean=[123.675,"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/cityscapes.py",
    "chars": 1780,
    "preview": "# dataset settings\ndataset_type = 'CityscapesDataset'\ndata_root = 'data/cityscapes/'\nimg_norm_cfg = dict(\n    mean=[123."
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/cityscapes_769x769.py",
    "chars": 1281,
    "preview": "_base_ = './cityscapes.py'\nimg_norm_cfg = dict(\n    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb="
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/drive.py",
    "chars": 1915,
    "preview": "# dataset settings\ndataset_type = 'DRIVEDataset'\ndata_root = 'data/DRIVE'\nimg_norm_cfg = dict(\n    mean=[123.675, 116.28"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/hrf.py",
    "chars": 1915,
    "preview": "# dataset settings\ndataset_type = 'HRFDataset'\ndata_root = 'data/HRF'\nimg_norm_cfg = dict(\n    mean=[123.675, 116.28, 10"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/pascal_context.py",
    "chars": 1998,
    "preview": "# dataset settings\ndataset_type = 'PascalContextDataset'\ndata_root = 'data/VOCdevkit/VOC2010/'\nimg_norm_cfg = dict(\n    "
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/pascal_context_59.py",
    "chars": 2024,
    "preview": "# dataset settings\ndataset_type = 'PascalContextDataset59'\ndata_root = 'data/VOCdevkit/VOC2010/'\nimg_norm_cfg = dict(\n  "
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/pascal_voc12.py",
    "chars": 1930,
    "preview": "# dataset settings\ndataset_type = 'PascalVOCDataset'\ndata_root = 'data/VOCdevkit/VOC2012'\nimg_norm_cfg = dict(\n    mean="
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/pascal_voc12_aug.py",
    "chars": 261,
    "preview": "_base_ = './pascal_voc12.py'\n# dataset settings\ndata = dict(\n    train=dict(\n        ann_dir=['SegmentationClass', 'Segm"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/datasets/stare.py",
    "chars": 1917,
    "preview": "# dataset settings\ndataset_type = 'STAREDataset'\ndata_root = 'data/STARE'\nimg_norm_cfg = dict(\n    mean=[123.675, 116.28"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/default_runtime.py",
    "chars": 321,
    "preview": "# yapf:disable\nlog_config = dict(\n    interval=50,\n    hooks=[\n        dict(type='TextLoggerHook', by_epoch=False),\n    "
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/ann_r50-d8.py",
    "chars": 1346,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/apcnet_r50-d8.py",
    "chars": 1302,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/ccnet_r50-d8.py",
    "chars": 1258,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/cgnet.py",
    "chars": 1110,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', eps=1e-03, requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/danet_r50-d8.py",
    "chars": 1261,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/deeplabv3_r50-d8.py",
    "chars": 1273,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/deeplabv3_unet_s5-d16.py",
    "chars": 1499,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/deeplabv3plus_r50-d8.py",
    "chars": 1343,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/dmnet_r50-d8.py",
    "chars": 1302,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/dnl_r50-d8.py",
    "chars": 1316,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/emanet_r50-d8.py",
    "chars": 1329,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/encnet_r50-d8.py",
    "chars": 1435,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/fast_scnn.py",
    "chars": 1761,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True, momentum=0.01)\nmodel = dict(\n    type='EncoderDecode"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/fcn_hr18.py",
    "chars": 1646,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/fcn_r50-d8.py",
    "chars": 1285,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/fcn_unet_s5-d16.py",
    "chars": 1512,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/fpn_r50.py",
    "chars": 1056,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/fpn_uniformer.py",
    "chars": 977,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    backbon"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/gcnet_r50-d8.py",
    "chars": 1326,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/lraspp_m-v3-d8.py",
    "chars": 766,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', eps=0.001, requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/nonlocal_r50-d8.py",
    "chars": 1315,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/ocrnet_hr18.py",
    "chars": 2196,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='CascadeEncoderDecoder',\n    "
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/ocrnet_r50-d8.py",
    "chars": 1385,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='CascadeEncoderDecoder',\n    "
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/pointrend_r50.py",
    "chars": 1704,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='CascadeEncoderDecoder',\n    "
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/psanet_r50-d8.py",
    "chars": 1406,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/pspnet_r50-d8.py",
    "chars": 1271,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/pspnet_unet_s5-d16.py",
    "chars": 1497,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/upernet_r50.py",
    "chars": 1301,
    "preview": "# model settings\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrai"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/models/upernet_uniformer.py",
    "chars": 1235,
    "preview": "# model settings\nnorm_cfg = dict(type='BN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoder',\n    pretrained="
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/schedules/schedule_160k.py",
    "chars": 382,
    "preview": "# optimizer\noptimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)\noptimizer_config = dict()\n# learnin"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/schedules/schedule_20k.py",
    "chars": 379,
    "preview": "# optimizer\noptimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)\noptimizer_config = dict()\n# learnin"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/schedules/schedule_40k.py",
    "chars": 379,
    "preview": "# optimizer\noptimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)\noptimizer_config = dict()\n# learnin"
  },
  {
    "path": "lavis/common/annotator/uniformer/configs/_base_/schedules/schedule_80k.py",
    "chars": 379,
    "preview": "# optimizer\noptimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)\noptimizer_config = dict()\n# learnin"
  },
  {
    "path": "lavis/common/annotator/uniformer/exp/upernet_global_small/config.py",
    "chars": 1316,
    "preview": "_base_ = [\n    '../../configs/_base_/models/upernet_uniformer.py', \n    '../../configs/_base_/datasets/ade20k.py',\n    '"
  },
  {
    "path": "lavis/common/annotator/uniformer/exp/upernet_global_small/run.sh",
    "chars": 382,
    "preview": "#!/usr/bin/env bash\n\nwork_path=$(dirname $0)\nPYTHONPATH=\"$(dirname $0)/../../\":$PYTHONPATH \\\npython -m torch.distributed"
  },
  {
    "path": "lavis/common/annotator/uniformer/exp/upernet_global_small/test.sh",
    "chars": 318,
    "preview": "#!/usr/bin/env bash\n\nwork_path=$(dirname $0)\nPYTHONPATH=\"$(dirname $0)/../../\":$PYTHONPATH \\\npython -m torch.distributed"
  },
  {
    "path": "lavis/common/annotator/uniformer/exp/upernet_global_small/test_config_g.py",
    "chars": 1317,
    "preview": "_base_ = [\n    '../../configs/_base_/models/upernet_uniformer.py', \n    '../../configs/_base_/datasets/ade20k.py',\n    '"
  },
  {
    "path": "lavis/common/annotator/uniformer/exp/upernet_global_small/test_config_h32.py",
    "chars": 1339,
    "preview": "_base_ = [\n    '../../configs/_base_/models/upernet_uniformer.py', \n    '../../configs/_base_/datasets/ade20k.py',\n    '"
  },
  {
    "path": "lavis/common/annotator/uniformer/exp/upernet_global_small/test_config_w32.py",
    "chars": 1339,
    "preview": "_base_ = [\n    '../../configs/_base_/models/upernet_uniformer.py', \n    '../../configs/_base_/datasets/ade20k.py',\n    '"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/__init__.py",
    "chars": 352,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\n# flake8: noqa\nfrom .arraymisc import *\nfrom .fileio import *\nfrom .imag"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/arraymisc/__init__.py",
    "chars": 133,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .quantization import dequantize, quantize\n\n__all__ = ['quantize', '"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/arraymisc/quantization.py",
    "chars": 1824,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport numpy as np\n\n\ndef quantize(arr, min_val, max_val, levels, dtype=n"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/__init__.py",
    "chars": 2438,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .alexnet import AlexNet\n# yapf: disable\nfrom .bricks import (ACTIVA"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/alexnet.py",
    "chars": 1990,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport logging\n\nimport torch.nn as nn\n\n\nclass AlexNet(nn.Module):\n    \"\""
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/__init__.py",
    "chars": 1732,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .activation import build_activation_layer\nfrom .context_block impor"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/activation.py",
    "chars": 2508,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/context_block.py",
    "chars": 4681,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nfrom torch import nn\n\nfrom ..utils import constant_init, ka"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv.py",
    "chars": 1446,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom torch import nn\n\nfrom .registry import CONV_LAYERS\n\nCONV_LAYERS.reg"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv2d_adaptive_padding.py",
    "chars": 2514,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport math\n\nfrom torch import nn\nfrom torch.nn import functional as F\n\n"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv_module.py",
    "chars": 8760,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport warnings\n\nimport torch.nn as nn\n\nfrom annotator.uniformer.mmcv.ut"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/conv_ws.py",
    "chars": 5417,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/depthwise_separable_conv_module.py",
    "chars": 4142,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch.nn as nn\n\nfrom .conv_module import ConvModule\n\n\nclass Depth"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/drop.py",
    "chars": 2172,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nimport torch.nn as nn\n\nfrom annotator.uniformer.mmcv import"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/generalized_attention.py",
    "chars": 15999,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport math\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimpor"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/hsigmoid.py",
    "chars": 1097,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch.nn as nn\n\nfrom .registry import ACTIVATION_LAYERS\n\n\n@ACTIVA"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/hswish.py",
    "chars": 651,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch.nn as nn\n\nfrom .registry import ACTIVATION_LAYERS\n\n\n@ACTIVA"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/non_local.py",
    "chars": 11012,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom abc import ABCMeta\n\nimport torch\nimport torch.nn as nn\n\nfrom ..util"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/norm.py",
    "chars": 5154,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport inspect\n\nimport torch.nn as nn\n\nfrom annotator.uniformer.mmcv.uti"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/padding.py",
    "chars": 1127,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch.nn as nn\n\nfrom .registry import PADDING_LAYERS\n\nPADDING_LAY"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/plugin.py",
    "chars": 2487,
    "preview": "import inspect\nimport platform\n\nfrom .registry import PLUGIN_LAYERS\n\nif platform.system() == 'Windows':\n    import regex"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/registry.py",
    "chars": 658,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom annotator.uniformer.mmcv.utils import Registry\n\nCONV_LAYERS = Regis"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/scale.py",
    "chars": 577,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nimport torch.nn as nn\n\n\nclass Scale(nn.Module):\n    \"\"\"A le"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/swish.py",
    "chars": 485,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nimport torch.nn as nn\n\nfrom .registry import ACTIVATION_LAY"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/transformer.py",
    "chars": 24637,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport copy\nimport warnings\n\nimport torch\nimport torch.nn as nn\n\nfrom an"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/upsample.py",
    "chars": 2880,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom ..utils impo"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/bricks/wrappers.py",
    "chars": 6961,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nr\"\"\"Modified from https://github.com/facebookresearch/detectron2/blob/ma"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/builder.py",
    "chars": 1089,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom ..runner import Sequential\nfrom ..utils import Registry, build_from"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/resnet.py",
    "chars": 9955,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport logging\n\nimport torch.nn as nn\nimport torch.utils.checkpoint as c"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/utils/__init__.py",
    "chars": 1023,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .flops_counter import get_model_complexity_info\nfrom .fuse_conv_bn "
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/utils/flops_counter.py",
    "chars": 22104,
    "preview": "# Modified from flops-counter.pytorch by Vladislav Sovrasov\n# original repo: https://github.com/sovrasov/flops-counter.p"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/utils/fuse_conv_bn.py",
    "chars": 1881,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport torch\nimport torch.nn as nn\n\n\ndef _fuse_conv_bn(conv, bn):\n    \"\""
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/utils/sync_bn.py",
    "chars": 2327,
    "preview": "import torch\n\nimport annotator.uniformer.mmcv as mmcv\n\n\nclass _BatchNormXd(torch.nn.modules.batchnorm._BatchNorm):\n    \""
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/utils/weight_init.py",
    "chars": 26006,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport copy\nimport math\nimport warnings\n\nimport numpy as np\nimport torch"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/cnn/vgg.py",
    "chars": 6053,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport logging\n\nimport torch.nn as nn\n\nfrom .utils import constant_init,"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/engine/__init__.py",
    "chars": 266,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .test import (collect_results_cpu, collect_results_gpu, multi_gpu_t"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/engine/test.py",
    "chars": 7196,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport os.path as osp\nimport pickle\nimport shutil\nimport tempfile\nimport"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/__init__.py",
    "chars": 478,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .file_client import BaseStorageBackend, FileClient\nfrom .handlers i"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/file_client.py",
    "chars": 41933,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport inspect\nimport os\nimport os.path as osp\nimport re\nimport tempfile"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/handlers/__init__.py",
    "chars": 278,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .base import BaseFileHandler\nfrom .json_handler import JsonHandler\n"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/handlers/base.py",
    "chars": 993,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom abc import ABCMeta, abstractmethod\n\n\nclass BaseFileHandler(metaclas"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/handlers/json_handler.py",
    "chars": 1068,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport json\n\nimport numpy as np\n\nfrom .base import BaseFileHandler\n\n\ndef"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/handlers/pickle_handler.py",
    "chars": 817,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport pickle\n\nfrom .base import BaseFileHandler\n\n\nclass PickleHandler(B"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/handlers/yaml_handler.py",
    "chars": 665,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport yaml\n\ntry:\n    from yaml import CLoader as Loader, CDumper as Dum"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/io.py",
    "chars": 5520,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom io import BytesIO, StringIO\nfrom pathlib import Path\n\nfrom ..utils "
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/fileio/parse.py",
    "chars": 3458,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\n\nfrom io import StringIO\n\nfrom .file_client import FileClient\n\n\ndef list"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/image/__init__.py",
    "chars": 1725,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .colorspace import (bgr2gray, bgr2hls, bgr2hsv, bgr2rgb, bgr2ycbcr,"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/image/colorspace.py",
    "chars": 9907,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport cv2\nimport numpy as np\n\n\ndef imconvert(img, src, dst):\n    \"\"\"Con"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/image/geometric.py",
    "chars": 25196,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport numbers\n\nimport cv2\nimport numpy as np\n\nfrom ..utils import to_2t"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/image/io.py",
    "chars": 9572,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport io\nimport os.path as osp\nfrom pathlib import Path\n\nimport cv2\nimp"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/image/misc.py",
    "chars": 1410,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport numpy as np\n\nimport annotator.uniformer.mmcv as mmcv\n\ntry:\n    im"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/image/photometric.py",
    "chars": 14999,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport cv2\nimport numpy as np\n\nfrom ..utils import is_tuple_of\nfrom .col"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/model_zoo/deprecated.json",
    "chars": 217,
    "preview": "{\n  \"resnet50_caffe\": \"detectron/resnet50_caffe\",\n  \"resnet50_caffe_bgr\": \"detectron2/resnet50_caffe_bgr\",\n  \"resnet101_"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/model_zoo/mmcls.json",
    "chars": 3690,
    "preview": "{\n  \"vgg11\": \"https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth\",\n  \""
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/model_zoo/open_mmlab.json",
    "chars": 5181,
    "preview": "{\n  \"vgg16_caffe\": \"https://download.openmmlab.com/pretrain/third_party/vgg16_caffe-292e1171.pth\",\n  \"detectron/resnet50"
  },
  {
    "path": "lavis/common/annotator/uniformer/mmcv/ops/__init__.py",
    "chars": 4506,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .assign_score_withk import assign_score_withk\nfrom .ball_query impo"
  }
]

// ... and 1183 more files (download for full content)

About this extraction

This page contains the full source code of the salesforce/LAVIS GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 1383 files (52.6 MB), approximately 8.5M tokens, and a symbol index with 4547 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo