Full Code of apple/ml-ferret for AI

main 3c9e5c93c2c3 cached

123 files

9.3 MB

2.4M tokens

572 symbols

1 requests

Download .txt

Showing preview only (9,782K chars total). Download the full file or copy to clipboard to get everything.

Repository: apple/ml-ferret
Branch: main
Commit: 3c9e5c93c2c3
Files: 123
Total size: 9.3 MB

Directory structure:
gitextract_uqnovqeg/

├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── EVAL.md
├── LICENSE
├── README.md
├── experiments/
│   ├── ferret_13b_train.sh
│   └── ferret_7b_train.sh
├── ferret/
│   ├── __init__.py
│   ├── constants.py
│   ├── conversation.py
│   ├── eval/
│   │   ├── eval_flickr_entities.py
│   │   ├── eval_gpt_review_3newclass.py
│   │   ├── eval_lvis.py
│   │   ├── eval_pope.py
│   │   ├── eval_refexp.py
│   │   ├── ferret_gpt4_data/
│   │   │   ├── ground_conv/
│   │   │   │   ├── answer.jsonl
│   │   │   │   ├── context.jsonl
│   │   │   │   └── question.jsonl
│   │   │   ├── refer_caption/
│   │   │   │   ├── answer.jsonl
│   │   │   │   ├── context.jsonl
│   │   │   │   └── question.jsonl
│   │   │   ├── refer_reason/
│   │   │   │   ├── answer.jsonl
│   │   │   │   ├── context.jsonl
│   │   │   │   └── question.jsonl
│   │   │   └── rule.json
│   │   ├── gpt4_eval_script.sh
│   │   ├── model_flickr.py
│   │   ├── model_gpt4eval_3newclass.py
│   │   ├── model_lvis.py
│   │   ├── model_point_cls_single_image.py
│   │   ├── model_pope.py
│   │   ├── model_refcoco.py
│   │   └── summarize_gpt_review.py
│   ├── mm_utils.py
│   ├── model/
│   │   ├── __init__.py
│   │   ├── apply_delta.py
│   │   ├── builder.py
│   │   ├── consolidate.py
│   │   ├── ferret_arch.py
│   │   ├── language_model/
│   │   │   └── ferret_llama.py
│   │   ├── make_delta.py
│   │   ├── multimodal_encoder/
│   │   │   ├── builder.py
│   │   │   └── clip_encoder.py
│   │   └── utils.py
│   ├── serve/
│   │   ├── __init__.py
│   │   ├── controller.py
│   │   ├── dejavu/
│   │   │   └── .uuid
│   │   ├── gradio_css.py
│   │   ├── gradio_web_server.py
│   │   ├── model_worker.py
│   │   └── register_worker.py
│   ├── train/
│   │   ├── ferret_trainer.py
│   │   ├── llama_flash_attn_monkey_patch.py
│   │   ├── train.py
│   │   └── train_mem.py
│   └── utils.py
├── ferretui/
│   ├── README.md
│   ├── ferretui/
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── conversation.py
│   │   ├── eval/
│   │   │   ├── model_UI.py
│   │   │   ├── table/
│   │   │   │   ├── answer/
│   │   │   │   │   ├── answer_alpaca-13b.jsonl
│   │   │   │   │   ├── answer_bard.jsonl
│   │   │   │   │   ├── answer_gpt35.jsonl
│   │   │   │   │   ├── answer_llama-13b.jsonl
│   │   │   │   │   └── answer_vicuna-13b.jsonl
│   │   │   │   ├── caps_boxes_coco2014_val_80.jsonl
│   │   │   │   ├── model.jsonl
│   │   │   │   ├── prompt.jsonl
│   │   │   │   ├── question.jsonl
│   │   │   │   ├── results/
│   │   │   │   │   ├── test_sqa_llava_13b_v0.json
│   │   │   │   │   └── test_sqa_llava_lcs_558k_sqa_12e_vicuna_v1_3_13b.json
│   │   │   │   ├── review/
│   │   │   │   │   ├── review_alpaca-13b_vicuna-13b.jsonl
│   │   │   │   │   ├── review_bard_vicuna-13b.jsonl
│   │   │   │   │   ├── review_gpt35_vicuna-13b.jsonl
│   │   │   │   │   └── review_llama-13b_vicuna-13b.jsonl
│   │   │   │   ├── reviewer.jsonl
│   │   │   │   └── rule.json
│   │   │   └── webpage/
│   │   │       ├── index.html
│   │   │       ├── script.js
│   │   │       └── styles.css
│   │   ├── mm_utils.py
│   │   ├── model/
│   │   │   ├── __init__.py
│   │   │   ├── apply_delta.py
│   │   │   ├── builder.py
│   │   │   ├── consolidate.py
│   │   │   ├── ferret_arch.py
│   │   │   ├── language_model/
│   │   │   │   ├── ferret_gemma.py
│   │   │   │   ├── ferret_llama.py
│   │   │   │   └── ferret_mpt.py
│   │   │   ├── make_delta.py
│   │   │   ├── multimodal_encoder/
│   │   │   │   ├── builder.py
│   │   │   │   └── clip_encoder.py
│   │   │   ├── multimodal_projector/
│   │   │   │   └── builder.py
│   │   │   └── utils.py
│   │   ├── serve/
│   │   │   ├── __init__.py
│   │   │   ├── cli.py
│   │   │   ├── controller.py
│   │   │   ├── gradio_web_server.py
│   │   │   ├── model_worker.py
│   │   │   ├── register_worker.py
│   │   │   ├── sglang_worker.py
│   │   │   └── test_message.py
│   │   ├── train/
│   │   │   ├── ferret_trainer.py
│   │   │   ├── llama_flash_attn_monkey_patch.py
│   │   │   ├── llama_xformers_attn_monkey_patch.py
│   │   │   ├── train.py
│   │   │   ├── train_mem.py
│   │   │   └── train_xformers.py
│   │   └── utils.py
│   ├── playground/
│   │   └── sample_data/
│   │       ├── eval_data_example_0_box_in.json
│   │       ├── eval_data_example_1_no_box_in.json
│   │       └── train_data_example.json
│   ├── pyproject.toml
│   └── scripts/
│       ├── eval/
│       │   └── eval_UI.sh
│       ├── train/
│       │   └── train_UI.sh
│       ├── zero2.json
│       ├── zero3.json
│       └── zero3_offload.json
├── pyproject.toml
└── scripts/
    ├── extract_geosampler_and_mm_projector.py
    └── verify_equal.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.egg-info
*.pyc
build/

# compilation and distribution
__pycache__
_ext
*.so
dist/

# pytorch/python/numpy formats
*.pth
*.pkl
*.npy

# Editor temporaries
*.swn
*.swo
*.swp
*~

# Pycharm editor settings
.idea

# vscode editor settings
.vscode

# MacOS
.DS_Store

# Jupyter Notebook
.ipynb_checkpoints

# Customized
checkpoints/

================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
  advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
  address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the open source team at [opensource-conduct@group.apple.com](mailto:opensource-conduct@group.apple.com). All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4,
available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)

================================================
FILE: CONTRIBUTING.md
================================================
# Contribution Guide

Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.

While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.

## Before you get started

By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the [LICENSE](LICENSE).

We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md).

================================================
FILE: EVAL.md
================================================
# Evaluation
All evaluation scripts provided usage details/cases in the first several lines of codes. 

## Ferret-Bench
Please follow [gpt4_eval_script.sh](ferret/eval/gpt4_eval_script.sh) to run inference on Ferret-Bench data and use GPT-4 to rate. It's noted that `openai` package should be installed and user's OPENAI_KEY should be provided.

## LVIS-Referring Object Classification
Run `ferret/eval/model_lvis.py` following the usage in the file and then run `ferret/eval/eval_lvis.py`.

## RefCOCO/RefCOCO+/RefCOCOg
Run `ferret/eval/model_refcoco.py` following the usage in the file and then run `ferret/eval/eval_refexp.py`.

## Flickr
Run `ferret/eval/model_flickr.py` following the usage in the file and then run `ferret/eval/eval_flickr_entities.py`.

## POPE
Run `ferret/eval/model_pope.py` following the usage in the file and then run `ferret/eval/eval_pope.py`.

================================================
FILE: LICENSE
================================================
Copyright (C) 2023 Apple Inc. All Rights Reserved.

IMPORTANT:  This Apple software is supplied to you by Apple
Inc. ("Apple") in consideration of your agreement to the following
terms, and your use, installation, modification or redistribution of
this Apple software constitutes acceptance of these terms.  If you do
not agree with these terms, please do not use, install, modify or
redistribute this Apple software.

In consideration of your agreement to abide by the following terms, and
subject to these terms, Apple grants you a personal, non-exclusive
license, under Apple's copyrights in this original Apple software (the
"Apple Software"), to use, reproduce, modify and redistribute the Apple
Software, with or without modifications, in source and/or binary forms;
provided that if you redistribute the Apple Software in its entirety and
without modifications, you must retain this notice and the following
text and disclaimers in all such redistributions of the Apple Software.
Neither the name, trademarks, service marks or logos of Apple Inc. may
be used to endorse or promote products derived from the Apple Software
without specific prior written permission from Apple.  Except as
expressly stated in this notice, no other rights or licenses, express or
implied, are granted by Apple herein, including but not limited to any
patent rights that may be infringed by your derivative works or by other
works in which the Apple Software may be incorporated.

The Apple Software is provided by Apple on an "AS IS" basis.  APPLE
MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.

IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.


================================================
FILE: README.md
================================================
<!-- # Project Name

This software project accompanies the research paper, [Paper title](https://arxiv.org).

Brief description of the project.

## Documentation

## Getting Started  -->

# <img src="figs/ferret_icon.png" alt="Alt text for the image" width="40" height="45"> Ferret: Refer and Ground Anything Anywhere at Any Granularity

*An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response.* [[Paper](https://arxiv.org/abs/2310.07704)]

[Haoxuan You*](https://hxyou.github.io/), [Haotian Zhang*](https://scholar.google.com/citations?user=1vz0kKUAAAAJ&hl=en/), [Zhe Gan](https://zhegan27.github.io/), [Xianzhi Du](https://scholar.google.com/citations?user=l1hP40AAAAAJ&hl=en), [Bowen Zhang](https://zbwglory.github.io/), [Zirui Wang](https://www.cs.cmu.edu/~ziruiw/), [Liangliang Cao](http://llcao.net/), [Shih-Fu Chang](https://www.ee.columbia.edu/~sfchang/), [Yinfei Yang](https://sites.google.com/site/yinfeiyang/) 
[*: equal contribution]


## Release
- [10/08/2024] 🔥 We release the [Ferret-UI](ferretui/), the first UI-centric MLLM that is capable of effectively executing **referring, grounding, and reasoning** tasks.
- [07/10/2024] 🔥 [Ferret-v2](https://arxiv.org/abs/2404.07973) is accepted to COLM 2024. 
- [02/15/2024] 🔥 Ferret is accepted to ICLR 2024 as a [Spotlight](https://iclr.cc/virtual/2024/poster/19537)!!! 
- [12/14/2023] 🔥 We release the Ferret [checkpoints(7B, 13B)](#checkpoints).
- [10/30/2023] 🔥 We release the code of **FERRET** model and [Ferret-Bench](ferret/eval/ferret_gpt4_data).

## Overview

<p align="center">
    <img src="figs/ferret_fig_diagram_v2.png" width="100%"></a> <br>
    Diagram of Ferret Model.
</p>

Key Contributions:
* Ferret Model - **Hybrid Region Representation + Spatial-aware Visual Sampler** enable fine-grained and open-vocabulary referring and grounding in MLLM.
* GRIT Dataset (~1.1M) - A **Large-scale, Hierarchical, Robust** ground-and-refer instruction tuning dataset.
* Ferret-Bench - A multimodal evaluation benchmark that jointly requires **Referring/Grounding, Semantics, Knowledge, and Reasoning**.


**Usage and License Notices**: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. 

## Contents
- [Install](#install)
- [Train](#train)
- [Evaluation](#evaluation)
- [Demo](#demo)

## Install

1. Clone this repository and navigate to FERRET folder
```bash
git clone https://github.com/apple/ml-ferret
cd ml-ferret
```

2. Install Package
```Shell
conda create -n ferret python=3.10 -y
conda activate ferret
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install pycocotools
pip install protobuf==3.20.0
```

3. Install additional packages for training cases
```
pip install ninja
pip install flash-attn --no-build-isolation
```


## Train

FERRET is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.

### Hyperparameters
We use a similar set of hyperparameters as LLaVA(Vicuna) in finetuning.  

| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
| --- | ---: | ---: | ---: | ---: | ---: |
| FERRET-7B | 128 | 2e-5 | 3 | 2048 | 0 |
| FERRET-13B | 128 | 2e-5 | 3 | 2048 | 0 |

### Prepare Vicuna checkpoint and LLaVA's projector

Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights following the instructions [here](https://github.com/lm-sys/FastChat#model-weights). Vicuna v1.3 is used in FERRET.

Then download LLaVA's first-stage pre-trained projector weight ([7B](https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-7b-v1.3), [13B](https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-13b-v1.3)).


### FERRET Training

The scripts are provided ([7B](experiments/ferret_7b_train.sh), [13B](experiments/ferret_13b_train.sh)).


## Evaluation

Please see this [doc](EVAL.md) for the details.

## Checkpoints
We extracted the `delta` between our pre-trained model and Vicuna. Please first download weights of Vicuna following the [previous instruction](#prepare-vicuna-checkpoint-and-llavas-projector). Then download our prepared offsets of weights: [7B](https://docs-assets.developer.apple.com/ml-research/models/ferret/ferret-7b/ferret-7b-delta.zip), [13B](https://docs-assets.developer.apple.com/ml-research/models/ferret/ferret-13b/ferret-13b-delta.zip) using `wget` or `curl`, and unzip the downloaded offsets. Lastly, apply the offset to the Vicuna's weight by running the following script:
```Shell
# 7B
python3 -m ferret.model.apply_delta \
    --base ./model/vicuna-7b-v1-3 \
    --target ./model/ferret-7b-v1-3 \
    --delta path/to/ferret-7b-delta
# 13B
python3 -m ferret.model.apply_delta \
    --base ./model/vicuna-13b-v1-3 \
    --target ./model/ferret-13b-v1-3 \
    --delta path/to/ferret-13b-delta
```

**Notices**: Apple's rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.

Please refer to the next section about how to set up a local demo with pre-trained weight.

## Demo

To run our demo, you need to train FERRET and use the checkpoints locally. Gradio web UI is used. Please run the following commands one by one. 

#### Launch a controller
```Shell
python -m ferret.serve.controller --host 0.0.0.0 --port 10000
```

#### Launch a gradio web server.
```Shell
python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --add_region_feature
```

#### Launch a model worker

This is the worker that load the ckpt and do the inference on the GPU.  Each worker is responsible for a single model specified in `--model-path`.

```Shell
CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/FERRET-13B-v0 --add_region_feature
```
Wait until the process finishes loading the model and you see "Uvicorn running on ...".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.


<p align="center">
    <img src="figs/ferret_demo.png" width="105%"></a> <br>
    Example of Ferret Interactive Demo.
</p>


## Citation

If you find Ferret useful, please cite using this BibTeX:

```bibtex
@article{you2023ferret,
  title={Ferret: Refer and Ground Anything Anywhere at Any Granularity},
  author={You, Haoxuan and Zhang, Haotian and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zirui and Cao, Liangliang and Chang, Shih-Fu and Yang, Yinfei},
  journal={arXiv preprint arXiv:2310.07704},
  year={2023}
}
```

## Acknowledgement

- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon. 
- [Vicuna](https://github.com/lm-sys/FastChat): the LLM codebase.


================================================
FILE: experiments/ferret_13b_train.sh
================================================
#!/usr/bin/env bash
set -xe

mkdir -p checkpoints

echo "Start Fine-Tuning"
# =================== Training ======================
data_path=(
            'dataset/git_instruction.json' 
            'dataset/vg_objects.json'  
            'dataset/vg_relations.json' 
            'dataset/vg_regions.json' 
            'dataset/grounded_llava_boxes_detail.json' 
            'dataset/grounded_llava_boxes_complex_reasoning.json' 
            'dataset/grounded_llava_boxes_conversation.json' 
            'dataset/refexp_all.json' 
            'dataset/flickr.json' 
            'dataset/objects365.json' 
            )
image_folder=(
            'dataset/coco2014/train2014' 
            'dataset/vg/images' 
            'dataset/vg/images' 
            'dataset/vg/images' 
            'dataset/coco2014/train2014' 
            'dataset/coco2014/train2014' 
            'dataset/coco2014/train2014' 
            'data/refcoco/train2014' 
            'data/flickr30k/flickr30k_images_split/train' 
            'data/objects365_v1/train' 
            )
data_multiple=(
            3 
            1 
            0.2 
            0.2 
            1 
            1 
            1 
            1 
            1 
            1 
            )

# convert array to string
data_path="${data_path[@]}"
image_folder="${image_folder[@]}"
data_multiple="${data_multiple[@]}"

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-13b-v1-3"
################## VICUNA ##################

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    ferret/train/train_mem.py \
    --lora_enable False \
    --model_name_or_path ./model/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path $data_path \
    --image_folder $image_folder \
    --data_multiple $data_multiple \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./model/llava-336px-pretrain-$MODEL_VERSION/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/ferret_13b \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1500 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --point_input_sample 'segment_mask|center' \
    --add_region_feature True \
    --region_geo_sampler True \
    --sampler_pooler_mode 'max' \
    --add_region_feature True \
    --refer_previous_point False \
    --resized_image_h 336 \
    --resized_image_w 336 \
    --save_vision_tower True



================================================
FILE: experiments/ferret_7b_train.sh
================================================
#!/usr/bin/env bash
set -xe

mkdir -p checkpoints

# =================== Training ======================
data_path=(
            'dataset/git_instruction.json' 
            'dataset/vg_objects.json'  
            'dataset/vg_relations.json' 
            'dataset/vg_regions.json' 
            'dataset/grounded_llava_boxes_detail.json' 
            'dataset/grounded_llava_boxes_complex_reasoning.json' 
            'dataset/grounded_llava_boxes_conversation.json' 
            'dataset/refexp_all.json' 
            'dataset/flickr.json' 
            'dataset/objects365.json' 
            )
image_folder=(
            'dataset/coco2014/train2014' 
            'dataset/vg/images' 
            'dataset/vg/images' 
            'dataset/vg/images' 
            'dataset/coco2014/train2014' 
            'dataset/coco2014/train2014' 
            'dataset/coco2014/train2014' 
            'data/refcoco/train2014' 
            'data/flickr30k/flickr30k_images_split/train' 
            'data/objects365_v1/train' 
            )
data_multiple=(
            3 
            1 
            0.2 
            0.2 
            1 
            1 
            1 
            1 
            1 
            1 
            )

# convert array to string
data_path="${data_path[@]}"
image_folder="${image_folder[@]}"
data_multiple="${data_multiple[@]}"

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-7b-v1-3"
################## VICUNA ##################

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    ferret/train/train_mem.py \
    --lora_enable False \
    --model_name_or_path ./model/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path $data_path \
    --image_folder $image_folder \
    --data_multiple $data_multiple \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./model/llava-336px-pretrain-$MODEL_VERSION/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/ferret_7b \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1500 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --point_input_sample 'segment_mask|center' \
    --add_region_feature True \
    --region_geo_sampler True \
    --sampler_pooler_mode 'max' \
    --add_region_feature True \
    --refer_previous_point False \
    --resized_image_h 336 \
    --resized_image_w 336 \
    --save_vision_tower True



================================================
FILE: ferret/__init__.py
================================================
from .model import FERRETLlamaForCausalLM


================================================
FILE: ferret/constants.py
================================================
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15

LOGDIR = "."

# Model Constants
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_IM_END_TOKEN = "<im_end>"


================================================
FILE: ferret/conversation.py
================================================
import dataclasses
from enum import auto, Enum
from typing import List, Tuple

VOCAB_IMAGE_W = 1000  # 224
VOCAB_IMAGE_H = 1000  # 224

class SeparatorStyle(Enum):
    """Different separator style."""
    SINGLE = auto()
    TWO = auto()
    MPT = auto()
    PLAIN = auto()
    LLAMA_2 = auto()


@dataclasses.dataclass
class Conversation:
    """A class that keeps all conversation history."""
    system: str
    roles: List[str]
    messages: List[List[str]]
    offset: int
    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
    sep: str = "###"
    sep2: str = None
    version: str = "Unknown"

    skip_next: bool = False
    first_round: bool = True


    def get_prompt(self):
        messages = self.messages
        if len(messages) > 0 and type(messages[0][1]) is tuple:
            messages = self.messages.copy()
            init_role, init_msg = messages[0].copy()
            init_msg = init_msg[0].replace("<image>", "").strip()
            if 'mmtag' in self.version:
                messages[0] = (init_role, init_msg)
                messages.insert(0, (self.roles[0], "<Image><image></Image>"))
                messages.insert(1, (self.roles[1], "Received."))
            else:
                messages[0] = (init_role, "<image>\n" + init_msg)

        if self.sep_style == SeparatorStyle.SINGLE:
            ret = self.system + self.sep
            for role, message in messages:
                if message:
                    if type(message) is tuple:
                        message, _, _ = message
                    ret += role + ": " + message + self.sep
                else:
                    ret += role + ":"
        elif self.sep_style == SeparatorStyle.TWO:
            seps = [self.sep, self.sep2]
            ret = self.system + seps[0]
            for i, (role, message) in enumerate(messages):
                if message:
                    if type(message) is tuple:
                        message, _, _ = message
                    ret += role + ": " + message + seps[i % 2]
                else:
                    ret += role + ":"
        elif self.sep_style == SeparatorStyle.MPT:
            ret = self.system + self.sep
            for role, message in messages:
                if message:
                    if type(message) is tuple:
                        message, _, _ = message
                    ret += role + message + self.sep
                else:
                    ret += role
        elif self.sep_style == SeparatorStyle.LLAMA_2:
            wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
            wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
            ret = ""

            for i, (role, message) in enumerate(messages):
                if i == 0:
                    assert message, "first message should not be none"
                    assert role == self.roles[0], "first message should come from user"
                if message:
                    if type(message) is tuple:
                        message, _, _ = message
                    if i == 0: message = wrap_sys(self.system) + message
                    if i % 2 == 0:
                        message = wrap_inst(message)
                        ret += self.sep + message
                    else:
                        ret += " " + message + " " + self.sep2
                else:
                    ret += ""
            ret = ret.lstrip(self.sep)
        elif self.sep_style == SeparatorStyle.PLAIN:
            seps = [self.sep, self.sep2]
            ret = self.system
            for i, (role, message) in enumerate(messages):
                if message:
                    if type(message) is tuple:
                        message, _, _ = message
                    ret += message + seps[i % 2]
                else:
                    ret += ""
        else:
            raise ValueError(f"Invalid style: {self.sep_style}")

        return ret

    def append_message(self, role, message):
        self.messages.append([role, message])

    def get_images(self, return_pil=False):
        images = []
        for i, (role, msg) in enumerate(self.messages[self.offset:]):
            if i % 2 == 0:
                if type(msg) is tuple:
                    import base64
                    from io import BytesIO
                    from PIL import Image
                    msg, image, image_process_mode = msg
                    if image_process_mode == "Pad":
                        def expand2square(pil_img, background_color=(122, 116, 104)):
                            width, height = pil_img.size
                            if width == height:
                                return pil_img
                            elif width > height:
                                result = Image.new(pil_img.mode, (width, width), background_color)
                                result.paste(pil_img, (0, (width - height) // 2))
                                return result
                            else:
                                result = Image.new(pil_img.mode, (height, height), background_color)
                                result.paste(pil_img, ((height - width) // 2, 0))
                                return result
                        image = expand2square(image)
                    elif image_process_mode == "Crop":
                        pass
                    elif image_process_mode == "Raw+Processor":
                        pass
                    elif image_process_mode == "Resize":
                        image = image.resize((336, 336))
                    else:
                        raise ValueError(f"Invalid image_process_mode: {image_process_mode}")

                    if image_process_mode != "Raw+Processor":
                        max_hw, min_hw = max(image.size), min(image.size)
                        aspect_ratio = max_hw / min_hw
                        max_len, min_len = 800, 400
                        shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
                        longest_edge = int(shortest_edge * aspect_ratio)
                        W, H = image.size
                        if H > W:
                            H, W = longest_edge, shortest_edge
                        else:
                            H, W = shortest_edge, longest_edge
                        image = image.resize((W, H))
                    print('Input Image Size:{}'.format(image.size))

                    if return_pil:
                        images.append(image)
                    else:
                        buffered = BytesIO()
                        image.save(buffered, format="PNG")
                        img_b64_str = base64.b64encode(buffered.getvalue()).decode()
                        images.append(img_b64_str)
        return images

    def to_gradio_chatbot(self):
        ret = []
        for i, (role, msg) in enumerate(self.messages[self.offset:]):
            if i % 2 == 0:
                if type(msg) is tuple:
                    import base64
                    from io import BytesIO
                    msg, image, image_process_mode = msg
                    if image_process_mode != "Raw+Processor":
                        max_hw, min_hw = max(image.size), min(image.size)
                        aspect_ratio = max_hw / min_hw
                        max_len, min_len = 800, 400
                        shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
                        longest_edge = int(shortest_edge * aspect_ratio)
                        W, H = image.size
                        if H > W:
                            H, W = longest_edge, shortest_edge
                        else:
                            H, W = shortest_edge, longest_edge
                        image = image.resize((W, H))
                    buffered = BytesIO()
                    image.save(buffered, format="JPEG")
                    img_b64_str = base64.b64encode(buffered.getvalue()).decode()
                    img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
                    ret.append([img_str, None])
                    msg = msg.replace('<image>', '').strip()
                    if len(msg) > 0:
                        ret.append([msg, None])
                else:
                    ret.append([msg, None])
            else:
                ret[-1][-1] = msg
        return ret

    def copy(self):
        return Conversation(
            system=self.system,
            roles=self.roles,
            messages=[[x, y] for x, y in self.messages],
            offset=self.offset,
            sep_style=self.sep_style,
            sep=self.sep,
            sep2=self.sep2,
            version=self.version)

    def dict(self):
        if len(self.get_images()) > 0:
            return {
                "system": self.system,
                "roles": self.roles,
                "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
                "offset": self.offset,
                "sep": self.sep,
                "sep2": self.sep2,
            }
        return {
            "system": self.system,
            "roles": self.roles,
            "messages": self.messages,
            "offset": self.offset,
            "sep": self.sep,
            "sep2": self.sep2,
        }



ferret_conv_vicuna_v1_original_system = Conversation(
    system="A chat between a curious human and an artificial intelligence assistant. "
           "Assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. "
           "In images, points are represented by coordinates [x, y]. The top-left corner is [0, 0]. The bottom-right corner is [width-1, height-1]. "
           "Increasing x moves right across the image while increasing y moves down. "
           "A bounding box is marked by [x1, y1, x2, y2] with the top-left and bottom-right points being [x1, y1] and [x2, y2] respectively. "
           f"The image size is assumed to be ({VOCAB_IMAGE_W}, {VOCAB_IMAGE_H}), i.e., width={VOCAB_IMAGE_W}, height={VOCAB_IMAGE_H}. "
           "Follow the instructions carefully. ",
    roles=("USER", "ASSISTANT"),
    version="v1",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.TWO,
    sep=" ",
    sep2="</s>",
)

ferret_conv_vicuna_v1 = Conversation(
    system="A chat between a human and an AI that understands visuals. "
           "In images, [x, y] denotes points: top-left [0, 0], bottom-right [width-1, height-1]. "
           "Increasing x moves right; y moves down. "
           f"Bounding box: [x1, y1, x2, y2]. Image size: {VOCAB_IMAGE_W}x{VOCAB_IMAGE_H}. "
           "Follow instructions. ",
    roles=("USER", "ASSISTANT"),
    version="v1",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.TWO,
    sep=" ",
    sep2="</s>",
)


default_conversation = ferret_conv_vicuna_v1
conv_templates = {
    "v1": ferret_conv_vicuna_v1,
    "ferret_v1": ferret_conv_vicuna_v1,
}


if __name__ == "__main__":
    print(default_conversation.get_prompt())


================================================
FILE: ferret/eval/eval_flickr_entities.py
================================================
"""
Usage:

python ferret/eval/eval_flickr_entities.py \
    --prediction_file result_checkpoint-final/flickr_result/final_flickr_mergedGT_test \
    --annotation_file data/annotations/final_flickr_mergedGT_test.json \
    --flickr_entities_path data/flickr30k

"""


import xml.etree.ElementTree as ET
from collections import defaultdict
from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

import numpy as np
from prettytable import PrettyTable
from tqdm import tqdm

import json
import os
import re

VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000


def resize_bbox(box, image_w=None, image_h=None):
    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H

    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
               int(box[2] * ratio_w), int(box[3] * ratio_h)]
    return new_box


def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
    entities = []
    boxes = []
    
    start = 0
    in_brackets = False
    entity = ""
    box = ""
    
    for i, char in enumerate(text):
        if char == '[':
            in_brackets = True
            entity = text[start:i].strip()
            start = i + 1
        elif char == ']':
            in_brackets = False
            box = text[start:i].strip()
            start = i + 1
            
            # Convert box string to list of integers
            box_list = list(map(int, box.split(',')))
            resized_box_list = resize_bbox(box_list, img_w, img_h)
            entities.append(entity)
            boxes.append(resized_box_list)
            
            # Skip until the next entity (ignoring periods or other delimiters)
            while start < len(text) and text[start] not in ['.', ',', ';', '!', '?']:
                start += 1
            start += 1  # Skip the delimiter
        
    return entities, boxes


def are_phrases_similar(phrase1, phrase2):
    # Step 1: Convert to lower case
    phrase1 = phrase1.lower()
    phrase2 = phrase2.lower()
    
    # Step 2: Standardize spacing around punctuation
    phrase1 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase1).strip()
    phrase2 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase2).strip()
    
    # Step 3: Remove all punctuation
    phrase1 = re.sub(r'[^\w\s]', '', phrase1)
    phrase2 = re.sub(r'[^\w\s]', '', phrase2)
    
    # Step 4: Remove extra white spaces
    phrase1 = ' '.join(phrase1.split())
    phrase2 = ' '.join(phrase2.split())
    
    return phrase1 == phrase2


def get_sentence_data(filename) -> List[Dict[str, Any]]:
    """
    Parses a sentence file from the Flickr30K Entities dataset

    input:
      filename - full file path to the sentence file to parse

    output:
      a list of dictionaries for each sentence with the following fields:
          sentence - the original sentence
          phrases - a list of dictionaries for each phrase with the
                    following fields:
                      phrase - the text of the annotated phrase
                      first_word_index - the position of the first word of
                                         the phrase in the sentence
                      phrase_id - an identifier for this phrase
                      phrase_type - a list of the coarse categories this
                                    phrase belongs to

    """
    with open(filename, "r") as f:
        sentences = f.read().split("\n")

    annotations = []
    for sentence in sentences:
        if not sentence:
            continue

        first_word = []
        phrases = []
        phrase_id = []
        phrase_type = []
        words = []
        current_phrase = []
        add_to_phrase = False
        for token in sentence.split():
            if add_to_phrase:
                if token[-1] == "]":
                    add_to_phrase = False
                    token = token[:-1]
                    current_phrase.append(token)
                    phrases.append(" ".join(current_phrase))
                    current_phrase = []
                else:
                    current_phrase.append(token)

                words.append(token)
            else:
                if token[0] == "[":
                    add_to_phrase = True
                    first_word.append(len(words))
                    parts = token.split("/")
                    phrase_id.append(parts[1][3:])
                    phrase_type.append(parts[2:])
                else:
                    words.append(token)

        sentence_data = {"sentence": " ".join(words), "phrases": []}
        for index, phrase, p_id, p_type in zip(first_word, phrases, phrase_id, phrase_type):
            sentence_data["phrases"].append(
                {"first_word_index": index, "phrase": phrase, "phrase_id": p_id, "phrase_type": p_type}
            )

        annotations.append(sentence_data)

    return annotations


def get_annotations(filename) -> Dict[str, Union[int, List[str], Dict[str, List[List[int]]]]]:
    """
    Parses the xml files in the Flickr30K Entities dataset

    input:
      filename - full file path to the annotations file to parse

    output:
      dictionary with the following fields:
          scene - list of identifiers which were annotated as
                  pertaining to the whole scene
          nobox - list of identifiers which were annotated as
                  not being visible in the image
          boxes - a dictionary where the fields are identifiers
                  and the values are its list of boxes in the
                  [xmin ymin xmax ymax] format
          height - int representing the height of the image
          width - int representing the width of the image
          depth - int representing the depth of the image
    """
    tree = ET.parse(filename)
    root = tree.getroot()
    size_container = root.findall("size")[0]
    anno_info: Dict[str, Union[int, List[str], Dict[str, List[List[int]]]]] = {}
    all_boxes: Dict[str, List[List[int]]] = {}
    all_noboxes: List[str] = []
    all_scenes: List[str] = []
    for size_element in size_container:
        assert size_element.text
        anno_info[size_element.tag] = int(size_element.text)

    for object_container in root.findall("object"):
        for names in object_container.findall("name"):
            box_id = names.text
            assert box_id
            box_container = object_container.findall("bndbox")
            if len(box_container) > 0:
                if box_id not in all_boxes:
                    all_boxes[box_id] = []
                xmin = int(box_container[0].findall("xmin")[0].text)
                ymin = int(box_container[0].findall("ymin")[0].text)
                xmax = int(box_container[0].findall("xmax")[0].text)
                ymax = int(box_container[0].findall("ymax")[0].text)
                all_boxes[box_id].append([xmin, ymin, xmax, ymax])
            else:
                nobndbox = int(object_container.findall("nobndbox")[0].text)
                if nobndbox > 0:
                    all_noboxes.append(box_id)

                scene = int(object_container.findall("scene")[0].text)
                if scene > 0:
                    all_scenes.append(box_id)
    anno_info["boxes"] = all_boxes
    anno_info["nobox"] = all_noboxes
    anno_info["scene"] = all_scenes

    return anno_info


#### END of import from flickr30k_entities
#### Bounding box utilities imported from torchvision and converted to numpy
def box_area(boxes: np.array) -> np.array:
    """
    Computes the area of a set of bounding boxes, which are specified by its
    (x1, y1, x2, y2) coordinates.

    Args:
        boxes (Tensor[N, 4]): boxes for which the area will be computed. They
            are expected to be in (x1, y1, x2, y2) format with
            ``0 <= x1 < x2`` and ``0 <= y1 < y2``.

    Returns:
        area (Tensor[N]): area for each box
    """
    assert boxes.ndim == 2 and boxes.shape[-1] == 4
    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])


# implementation from https://github.com/kuangliu/torchcv/blob/master/torchcv/utils/box.py
# with slight modifications
def _box_inter_union(boxes1: np.array, boxes2: np.array) -> Tuple[np.array, np.array]:
    area1 = box_area(boxes1)
    area2 = box_area(boxes2)

    lt = np.maximum(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
    rb = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]

    wh = (rb - lt).clip(min=0)  # [N,M,2]
    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

    union = area1[:, None] + area2 - inter

    return inter, union


def box_iou(boxes1: np.array, boxes2: np.array) -> np.array:
    """
    Return intersection-over-union (Jaccard index) of boxes.

    Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with
    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.

    Args:
        boxes1 (Tensor[N, 4])
        boxes2 (Tensor[M, 4])

    Returns:
        iou (Tensor[N, M]): the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2
    """
    inter, union = _box_inter_union(boxes1, boxes2)
    iou = inter / union
    return iou


#### End of import of box utilities

def _merge_boxes(boxes: List[List[int]]) -> List[List[int]]:
    """
    Return the boxes corresponding to the smallest enclosing box containing all the provided boxes
    The boxes are expected in [x1, y1, x2, y2] format
    """
    if len(boxes) == 1:
        return boxes

    np_boxes = np.asarray(boxes)

    return [[np_boxes[:, 0].min(), np_boxes[:, 1].min(), np_boxes[:, 2].max(), np_boxes[:, 3].max()]]

class RecallTracker:
    """ Utility class to track recall@k for various k, split by categories"""

    def __init__(self, topk: Sequence[int]):
        """
        Parameters:
           - topk : tuple of ints corresponding to the recalls being tracked (eg, recall@1, recall@10, ...)
        """

        self.total_byk_bycat: Dict[int, Dict[str, int]] = {k: defaultdict(int) for k in topk}
        self.positives_byk_bycat: Dict[int, Dict[str, int]] = {k: defaultdict(int) for k in topk}

    def add_positive(self, k: int, category: str):
        """Log a positive hit @k for given category"""
        if k not in self.total_byk_bycat:
            raise RuntimeError(f"{k} is not a valid recall threshold")
        self.total_byk_bycat[k][category] += 1
        self.positives_byk_bycat[k][category] += 1

    def add_negative(self, k: int, category: str):
        """Log a negative hit @k for given category"""
        if k not in self.total_byk_bycat:
            raise RuntimeError(f"{k} is not a valid recall threshold")
        self.total_byk_bycat[k][category] += 1

    def report(self) -> Dict[int, Dict[str, float]]:
        """Return a condensed report of the results as a dict of dict.
        report[k][cat] is the recall@k for the given category
        """
        report: Dict[int, Dict[str, float]] = {}
        for k in self.total_byk_bycat:
            assert k in self.positives_byk_bycat
            report[k] = {
                cat: self.positives_byk_bycat[k][cat] / self.total_byk_bycat[k][cat] for cat in self.total_byk_bycat[k]
            }
        return report


class Flickr30kEntitiesRecallEvaluator:
    def __init__(
        self,
        flickr_path: str,
        subset: str = "test",
        topk: Sequence[int] = (1, 5, 10, -1),
        iou_thresh: float = 0.5,
        merge_boxes: bool = False,
        verbose: bool = True,
    ):

        assert subset in ["train", "test", "val"], f"Wrong flickr subset {subset}"

        self.topk = topk
        self.iou_thresh = iou_thresh

        flickr_path = Path(flickr_path)

        # We load the image ids corresponding to the current subset
        with open(flickr_path / f"{subset}.txt") as file_d:
            self.img_ids = [line.strip() for line in file_d]

        if verbose:
            print(f"Flickr subset contains {len(self.img_ids)} images")

        # Read the box annotations for all the images
        self.imgid2boxes: Dict[str, Dict[str, List[List[int]]]] = {}

        if verbose:
            print("Loading annotations...")

        for img_id in self.img_ids:
            anno_info = get_annotations(flickr_path / "Annotations" / f"{img_id}.xml")["boxes"]
            if merge_boxes:
                merged = {}
                for phrase_id, boxes in anno_info.items():
                    merged[phrase_id] = _merge_boxes(boxes)
                anno_info = merged
            self.imgid2boxes[img_id] = anno_info

        # Read the sentences annotations
        self.imgid2sentences: Dict[str, List[List[Optional[Dict]]]] = {}

        if verbose:
            print("Loading annotations...")

        self.all_ids: List[str] = []
        tot_phrases = 0
        for img_id in self.img_ids:
            sentence_info = get_sentence_data(flickr_path / "Sentences" / f"{img_id}.txt")
            self.imgid2sentences[img_id] = [None for _ in range(len(sentence_info))]

            # Some phrases don't have boxes, we filter them.
            for sent_id, sentence in enumerate(sentence_info):
                phrases = [phrase for phrase in sentence["phrases"] if phrase["phrase_id"] in self.imgid2boxes[img_id]]
                if len(phrases) > 0:
                    self.imgid2sentences[img_id][sent_id] = phrases
                tot_phrases += len(phrases)

            self.all_ids += [
                f"{img_id}_{k}" for k in range(len(sentence_info)) if self.imgid2sentences[img_id][k] is not None
            ]

        if verbose:
            print(f"There are {tot_phrases} phrases in {len(self.all_ids)} sentences to evaluate")

    def evaluate(self, predictions: List[Dict]):
        evaluated_ids = set()

        recall_tracker = RecallTracker(self.topk)

        for pred in predictions:
            cur_id = f"{pred['image_id']}_{pred['sentence_id']}"
            if cur_id in evaluated_ids:
                print(
                    "Warning, multiple predictions found for sentence"
                    f"{pred['sentence_id']} in image {pred['image_id']}"
                )
                continue

            # Skip the sentences with no valid phrase
            if cur_id not in self.all_ids:
                if len(pred["boxes"]) != 0:
                    print(
                        f"Warning, in image {pred['image_id']} we were not expecting predictions "
                        f"for sentence {pred['sentence_id']}. Ignoring them."
                    )
                continue

            evaluated_ids.add(cur_id)

            pred_boxes = pred["boxes"]
            if str(pred["image_id"]) not in self.imgid2sentences:
                raise RuntimeError(f"Unknown image id {pred['image_id']}")
            if not 0 <= int(pred["sentence_id"]) < len(self.imgid2sentences[str(pred["image_id"])]):
                raise RuntimeError(f"Unknown sentence id {pred['sentence_id']}" f" in image {pred['image_id']}")
            target_sentence = self.imgid2sentences[str(pred["image_id"])][int(pred["sentence_id"])]

            phrases = self.imgid2sentences[str(pred["image_id"])][int(pred["sentence_id"])]
            if len(pred_boxes) != len(phrases):
                raise RuntimeError(
                    f"Error, got {len(pred_boxes)} predictions, expected {len(phrases)} "
                    f"for sentence {pred['sentence_id']} in image {pred['image_id']}"
                )

            for cur_boxes, phrase in zip(pred_boxes, phrases):
                target_boxes = self.imgid2boxes[str(pred["image_id"])][phrase["phrase_id"]]

                ious = box_iou(np.asarray(cur_boxes), np.asarray(target_boxes))
                for k in self.topk:
                    maxi = 0
                    if k == -1:
                        maxi = ious.max()
                    else:
                        assert k > 0
                        maxi = ious[:k].max()
                    if maxi >= self.iou_thresh:
                        recall_tracker.add_positive(k, "all")
                        for phrase_type in phrase["phrase_type"]:
                            recall_tracker.add_positive(k, phrase_type)
                    else:
                        recall_tracker.add_negative(k, "all")
                        for phrase_type in phrase["phrase_type"]:
                            recall_tracker.add_negative(k, phrase_type)

        if len(evaluated_ids) != len(self.all_ids):
            print("ERROR, the number of evaluated sentence doesn't match. Missing predictions:")
            un_processed = set(self.all_ids) - evaluated_ids
            for missing in un_processed:
                img_id, sent_id = missing.split("_")
                print(f"\t sentence {sent_id} in image {img_id}")
            raise RuntimeError("Missing predictions")

        return recall_tracker.report()


class Flickr30kEntitiesRecallEvaluatorFromJsonl(Flickr30kEntitiesRecallEvaluator):
    def evaluate(self, 
                 annotation_file: str,
                 prediction_file: str,
                 verbose: bool = False,
                ):
        recall_tracker = RecallTracker(self.topk)
        
        gt_json = json.load(open(annotation_file, 'r', encoding='utf-8'))

        # get the predictions
        if os.path.isfile(prediction_file):
            predictions = [json.loads(line) for line in open(prediction_file)]
        elif os.path.isdir(prediction_file):
            predictions = [json.loads(line) for pred_file in sorted(os.listdir(prediction_file)) for line in open(os.path.join(prediction_file, pred_file))]
        else:
            raise NotImplementedError('Not supported file format.')
        
        predict_index = 0
        
        valid_cnt = 0
        for item in tqdm(gt_json['images']):
            file_name = item["file_name"]
            caption = item["caption"]
            img_height = float(item['height'])
            img_width = float(item['width'])
            postive_item_pos = item['tokens_positive_eval']
            
            # to verify 
            phrases_from_self = self.imgid2sentences[str(item['original_img_id'])][int(item['sentence_id'])]
            for pos in postive_item_pos:
                # pdb.set_trace()
                if predict_index == len(predictions):
                    break
                
                pos_start, pos_end = pos[0]
                phrase = caption[pos_start:pos_end]
                phrase_from_self = [p for p in phrases_from_self if p['phrase'] == phrase]
                if len(phrase_from_self) == 0:
                    raise ValueError(f"Can't find the corresponding gt from two file {phrase} vs. {phrases_from_self}")
                else:
                    phrase_from_self = phrase_from_self[0]
                
                # get the prediction from text line
                try:
                    prediction = predictions[predict_index]["text"]
                except IndexError as e:
                    print("Raise Indexerror.")
                    print(f"prediction index / length: {predict_index} / {len(predictions)}")
                    import sys
                    sys.exit(0)
                try:
                    entities, boxes = decode_bbox_from_caption(prediction, img_width, img_height, verbose=verbose)
                    assert len(entities) == len(boxes)
                except ValueError as e:
                    entities, boxes = [], []

                predict_boxes = []

                for (entity, box) in zip(entities, boxes):
                    if not are_phrases_similar(entity, phrase): # get the matched noun phrase
                        # print(f"{entity} | {phrase}")
                        continue
                    else:
                        predict_boxes.append(box)

                if len(predict_boxes) == 0:
                    print(f"Can't find valid bbox for the given phrase ({phrase}) in caption ({caption}), \n{prediction}")
                    print(f"We set a 0-area box to calculate recall result")
                    predict_boxes = [[0., 0., 0., 0.]]
                    # exit(0)
                
                # evaluate
                target_boxes = self.imgid2boxes[str(item['original_img_id'])][phrase_from_self["phrase_id"]]
                ious = box_iou(np.asarray(predict_boxes), np.asarray(target_boxes))
                for k in self.topk:
                    maxi = 0
                    if k == -1:
                        maxi = ious.max()
                    else:
                        assert k > 0
                        maxi = ious[:k].max()
                    if maxi >= self.iou_thresh:
                        recall_tracker.add_positive(k, "all")
                        for phrase_type in phrase_from_self["phrase_type"]:
                            recall_tracker.add_positive(k, phrase_type)
                    else:
                        recall_tracker.add_negative(k, "all")
                        for phrase_type in phrase_from_self["phrase_type"]:
                            recall_tracker.add_negative(k, phrase_type)
                            
                # pdb.set_trace()
                valid_cnt += 1
            predict_index += 1
  
        print(f"Valid prediction {valid_cnt}/{len(predictions)}")     
        self.results = recall_tracker.report()
        return self.results
    
    def summarize(self):
        table = PrettyTable()
        all_cat = sorted(list(self.results.values())[0].keys())
        table.field_names = ["Recall@k"] + all_cat

        score = {}
        for k, v in self.results.items():
            cur_results = [v[cat] for cat in all_cat]
            header = "Upper_bound" if k == -1 else f"Recall@{k}"

            for cat in all_cat:
                score[f"{header}_{cat}"] = v[cat]
            table.add_row([header] + cur_results)

        print(table)
        return score


if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument('--prediction_file', help='prediction_file')
    parser.add_argument('--annotation_file', default='/path/to/final_flickr_mergedGT_test.json', help='annotation_file')
    parser.add_argument('--flickr_entities_path', default='/path/to/flickr30k_entities', help='flickr entities')
    
    args = parser.parse_args()

    if os.path.isfile(args.prediction_file):
        predictions = [json.loads(line) for line in open(args.prediction_file)]
    elif os.path.isdir(args.prediction_file):
        predictions = []
    
    if '_test.json' in args.annotation_file:
        subset = "test"
    else:
        subset = "val"
    
    evaluator = Flickr30kEntitiesRecallEvaluatorFromJsonl(
        flickr_path = args.flickr_entities_path,
        subset = subset,
        topk = (1, 5, 10, -1),
        iou_thresh = 0.5,
        merge_boxes = True,
        verbose = True,
    )
    
    evaluator.evaluate(args.annotation_file, args.prediction_file, verbose=False)
    score = evaluator.summarize()
    
    with open(os.path.join(args.prediction_file, "metric.json"), "w") as f:
        json.dump(score, f, indent=2)

================================================
FILE: ferret/eval/eval_gpt_review_3newclass.py
================================================
import argparse
import json
import os

import openai
import time
import re
import pdb
from tqdm import tqdm

NUM_SECONDS_TO_SLEEP = 0.5
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000

def get_eval(content: str, max_tokens: int):
    while True:
        try:
            response = openai.ChatCompletion.create(
                model='gpt-4-0314',
                messages=[{
                    'role': 'system',
                    'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
                }, {
                    'role': 'user',
                    'content': content,
                }],
                temperature=0.2,  # TODO: figure out which temperature is best for evaluation
                max_tokens=max_tokens,
            )
            break
        except openai.error.RateLimitError:
            pass
        except Exception as e:
            print(e)
        time.sleep(NUM_SECONDS_TO_SLEEP)

    return response['choices'][0]['message']['content']

def postprocess_answer(answer, category):
    if category == 'refer_desc' or category == 'refer_reason':
        pattern = r'\[.*?\]'
        matches = re.findall(pattern, answer)
        for match in matches:
            answer = answer.replace(' '+match, '')
    elif category == 'ground_conv':
        pattern = r'\[.*?\]'
        matches = re.findall(pattern, answer)
        for match in matches:
            coor_cur = match.replace('[', '')
            coor_cur = coor_cur.replace(']', '')
            coor_cur = coor_cur.split(',')
            coor_cur = [float(i.strip()) for i in coor_cur]
            try:
                assert len(coor_cur) == 4 
            except:
                print('Found a exception when parsing coordinates')
                answer = answer.replace(match, '')
            converted_box_coor = [coor_cur[0]/VOCAB_IMAGE_W, coor_cur[1]/VOCAB_IMAGE_H, coor_cur[2]/VOCAB_IMAGE_W, coor_cur[3]/VOCAB_IMAGE_H]
            answer = answer.replace(match, f'[{converted_box_coor[0]:.3f}, {converted_box_coor[1]:.3f}, {converted_box_coor[2]:.3f}, {converted_box_coor[3]:.3f}]')

    return answer


def parse_score(review):
    try:
        score_pair = review.split('\n')[0]
        score_pair = score_pair.replace(',', ' ')
        sp = score_pair.split(' ')
        if len(sp) == 2:
            return [float(sp[0]), float(sp[1])]
        else:
            print('error', review)
            return [-1, -1]
    except Exception as e:
        print(e)
        print('error', review)
        return [-1, -1]


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
    parser.add_argument('-q', '--question')
    parser.add_argument('-c', '--context')
    parser.add_argument('-a', '--answer-list', nargs='+', default=[])
    parser.add_argument('-r', '--rule')
    parser.add_argument('-o', '--output')
    parser.add_argument('--max-tokens', type=int, default=1024, help='maximum number of tokens produced in the output')
    args = parser.parse_args()

    f_q = open(os.path.expanduser(args.question))
    f_ans1 = open(os.path.expanduser(args.answer_list[0]))
    f_ans2 = open(os.path.expanduser(args.answer_list[1]))
    rule_dict = json.load(open(os.path.expanduser(args.rule), 'r'))

    if os.path.isfile(os.path.expanduser(args.output)):
        cur_reviews = [json.loads(line) for line in open(os.path.expanduser(args.output))]
    else:
        cur_reviews = []

    review_file = open(f'{args.output}', 'a')

    context_list = [json.loads(line) for line in open(os.path.expanduser(args.context))]
    image_to_context = {context['image']: context for context in context_list}

    handles = []
    idx = 0
    for ques_js, ans1_js, ans2_js in tqdm(zip(f_q, f_ans1, f_ans2)):
        ques = json.loads(ques_js)
        ans1 = json.loads(ans1_js)
        ans2 = json.loads(ans2_js)

        inst = image_to_context[ques['image']]
        # cap_str = '\n'.join(inst['captions'])
        # box_str = '\n'.join([f'{instance["category"]}: {instance["bbox"]}' for instance in inst['instances']])

        category = json.loads(ques_js)['category']
        if category in rule_dict:
            rule = rule_dict[category]
        else:
            assert False, f"Visual QA category not found in rule file: {category}."
        
        # Assume ans2 is the predicted one.
        processed_answer = postprocess_answer(ans2['text'], category)
        # pdb.set_trace()
        ans2['text'] = processed_answer
        # if category == 'refer_desc':
            
        prompt = rule['prompt']
        role = rule['role']
        content = (f'[Context]\{inst["text"]}\n\n'
                   f'[Question]\n{ques["text"]}\n\n'
                   f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
                   f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
                   f'[System]\n{prompt}\n\n')
        # content = (f'[Context]\n{cap_str}\n\n{box_str}\n\n'
        #            f'[Question]\n{ques["text"]}\n\n'
        #            f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
        #            f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
        #            f'[System]\n{prompt}\n\n')
        cur_js = {
            'id': idx+1,
            'question_id': ques['question_id'],
            'answer1_id': ans1.get('answer_id', ans1['question_id']),
            'answer2_id': ans2.get('answer_id', ans2['question_id']),
            'category': category
        }
        if idx >= len(cur_reviews):
            review = get_eval(content, args.max_tokens)
            scores = parse_score(review)
            cur_js['content'] = review
            cur_js['tuple'] = scores
            cur_js['answer1'] = ans1["text"]
            cur_js['answer2'] = ans2["text"]
            review_file.write(json.dumps(cur_js) + '\n')
            review_file.flush()
        else:
            print(f'Skipping {idx} as we already have it.')
        idx += 1
        print(idx)
    review_file.close()


================================================
FILE: ferret/eval/eval_lvis.py
================================================
"""
Usage:
- Eval Prediction:
python ferret/eval/eval_lvis.py --pred_file=[your generated result by running ferret/eval/model_lvis.py]

"""
import argparse
import json
import os
import re
import random
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import textwrap
from tqdm import tqdm
import pdb

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--pred_file', type=str, default='/Users/youhaoxuan/research_misc/lvis_result/llava_answer_debug.jsonl')
    return parser.parse_args()

def remove_not_phrases_v2(text):
    # Pattern covers the start of a phrase up to and including 'not' and any following characters until a comma or period
    pattern = r"\s+not[^,.]*[,.]"
    text = re.sub(pattern, "", text)
    pattern = r"\s+no[^,.]*[,.]"
    text = re.sub(pattern, "", text)
    return text 

if __name__ == "__main__":
    args = get_args()
    # Fix the random seed
    random.seed(42)
    if os.path.isfile(args.pred_file):
        predictions = [json.loads(line) for line in open(args.pred_file)]
    elif os.path.isdir(args.pred_file):
        predictions = [json.loads(line) for pred_file in os.listdir(args.pred_file) for line in open(os.path.join(args.pred_file, pred_file))]
    else:
        raise NotImplementedError('Not supported file format.')

    total_correct = 0
    for i in tqdm(predictions):
        # Process name and synonyms
        i['name'] = i['name'].replace('_', ' ').strip()
        new_synonyms = []
        for jj in i['synonyms']:
            if '(' in jj:
                assert ')' in jj
                split_list = jj.split('(')
                assert len(split_list) == 2
                new_synonyms.append(split_list[0].replace('_', ' ').strip())
                new_synonyms.append(split_list[1].replace('_', ' ').replace(')', '').strip())
            else:
                new_synonyms.append(jj.replace('_', ' ').strip())
        i['synonyms'] = new_synonyms

        # Match Result
        processed_text = remove_not_phrases_v2(i['text'])
        # pdb.set_trace()
        if i['name'] in processed_text or any(syn_i in processed_text for syn_i in i['synonyms']):
            total_correct += 1
        else:
            pass

    acc = total_correct / len(predictions)
    print(f'Acc:{acc*100:.3f}%')
    # pdb.set_trace()


================================================
FILE: ferret/eval/eval_pope.py
================================================
"""
Usage:

python ferret/eval/eval_pope.py \
    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_adversarial \
    --annotation_file data/pope/coco_pope_adversarial.json

python ferret/eval/eval_pope.py \
    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_popular \
    --annotation_file data/pope/coco_pope_popular.json

python ferret/eval/eval_pope.py \
    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_random \
    --annotation_file data/pope/coco_pope_random.json

"""
import os
import json

def evaluate_pope(prediction_file, annotation_file):
    # get the predictions
    if os.path.isfile(prediction_file):
        answers = [json.loads(line) for line in open(prediction_file)]
    elif os.path.isdir(prediction_file):
        answers = [json.loads(line) for pred_file in sorted(os.listdir(prediction_file)) for line in open(os.path.join(prediction_file, pred_file))]
    else:
        raise NotImplementedError('Not supported file format.')

    label_list = [json.loads(q)['label'] for q in open(annotation_file, 'r')]

    for answer in answers:
        text = answer['answer']

        # Only keep the first sentence
        if text.find('.') != -1:
            text = text.split('.')[0]

        text = text.replace(',', '')
        words = text.split(' ')
        if 'No' in words or 'not' in words or 'no' in words:
            answer['answer'] = 'no'
        else:
            answer['answer'] = 'yes'

    for i in range(len(label_list)):
        if label_list[i] == 'no':
            label_list[i] = 0
        else:
            label_list[i] = 1

    pred_list = []
    for answer in answers:
        if answer['answer'] == 'no':
            pred_list.append(0)
        else:
            pred_list.append(1)

    pos = 1
    neg = 0
    yes_ratio = pred_list.count(1) / len(pred_list)

    TP, TN, FP, FN = 0, 0, 0, 0
    for pred, label in zip(pred_list, label_list):
        if pred == pos and label == pos:
            TP += 1
        elif pred == pos and label == neg:
            FP += 1
        elif pred == neg and label == neg:
            TN += 1
        elif pred == neg and label == pos:
            FN += 1

    print('TP\tFP\tTN\tFN\t')
    print('{}\t{}\t{}\t{}'.format(TP, FP, TN, FN))

    precision = float(TP) / float(TP + FP)
    recall = float(TP) / float(TP + FN)
    f1 = 2*precision*recall / (precision + recall)
    acc = (TP + TN) / (TP + TN + FP + FN)
    print('Accuracy: {}'.format(acc))
    print('Precision: {}'.format(precision))
    print('Recall: {}'.format(recall))
    print('F1 score: {}'.format(f1))
    print('Yes ratio: {}'.format(yes_ratio))

    score = {"Accuracy": acc, 
             "Precision": precision,
             "Recall": recall,
             "F1 score": f1,
             "Yes ratio": yes_ratio,
             }

    with open(os.path.join(args.prediction_file, 'metric.json'), "w") as f:
        json.dump(score, f, indent=2)


if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument('--prediction_file', help='prediction_file')
    parser.add_argument('--annotation_file', default='/path/to/json_annotations', help='annotation_file')
    
    args = parser.parse_args()
    evaluate_pope(args.prediction_file, args.annotation_file)

================================================
FILE: ferret/eval/eval_refexp.py
================================================
"""
Usage:

python ferret/eval/eval_refexp.py \
    --prediction_file final_result/ferret_13b_checkpoint-final/refexp_result/finetune_refcocog_test \
    --annotation_file data/annotations/finetune_refcocog_test.json

"""
import os
import copy
from collections import defaultdict
from pathlib import Path
from tqdm import tqdm

import torch
import torch.utils.data
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
from prettytable import PrettyTable

import re
import json

from misc.refcoco.box_ops import generalized_box_iou, box_iou

VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000


def resize_bbox(box, image_w=None, image_h=None):
    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H

    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
               int(box[2] * ratio_w), int(box[3] * ratio_h)]
    return new_box


def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
    entities = []
    boxes = []
    
    start = 0
    in_brackets = False
    entity = ""
    box = ""
    
    for i, char in enumerate(text):
        if char == '[':
            in_brackets = True
            entity = text[start:i].strip()
            start = i + 1
        elif char == ']':
            in_brackets = False
            box = text[start:i].strip()
            start = i + 1
            
            # Convert box string to list of integers
            box_list = list(map(int, box.split(',')))
            resized_box_list = resize_bbox(box_list, img_w, img_h)
            entities.append(entity)
            boxes.append(resized_box_list)
            
            # Skip until the next entity (ignoring periods or other delimiters)
            while start < len(text) and text[start] not in ['.', ',', ';', '!', '?']:
                start += 1
            start += 1  # Skip the delimiter
        
    return entities, boxes


def are_phrases_similar(phrase1, phrase2):
    # Step 1: Convert to lower case
    phrase1 = phrase1.lower()
    phrase2 = phrase2.lower()
    
    # Step 2: Standardize spacing around punctuation
    phrase1 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase1).strip()
    phrase2 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase2).strip()
    
    # Step 3: Remove all punctuation
    phrase1 = re.sub(r'[^\w\s]', '', phrase1)
    phrase2 = re.sub(r'[^\w\s]', '', phrase2)
    
    # Step 4: Remove extra white spaces
    phrase1 = ' '.join(phrase1.split())
    phrase2 = ' '.join(phrase2.split())
    
    return phrase1 == phrase2


class RefExpEvaluatorFromJsonl(object):
    def __init__(self, refexp_gt_path, k=(1, -1), thresh_iou=0.5):
        assert isinstance(k, (list, tuple))
        with open(refexp_gt_path, 'r') as f:
            self.refexp_gt = json.load(f)
        self.img_ids = [item['id'] for item in self.refexp_gt['images']]
        print(f"Load {len(self.img_ids)} images")
        print(f"Load {len(self.refexp_gt['annotations'])} annotations")
        self.k = k
        self.thresh_iou = thresh_iou

    def summarize(self,
                  prediction_file: str,
                  verbose: bool = False,):
        
        # get the predictions
        if os.path.isfile(prediction_file):
            predictions = [json.loads(line) for line in open(prediction_file)]
        elif os.path.isdir(prediction_file):
            predictions = [json.loads(line) for pred_file in os.listdir(prediction_file) for line in open(os.path.join(prediction_file, pred_file))]
        else:
            raise NotImplementedError('Not supported file format.')
        
        # sort the predictions based on 'image_id'
        predictions = sorted(predictions, key=lambda x: x['image_id'])

        predict_index = 0
        
        dataset2score = {
            "refcoco": {k: 0.0 for k in self.k},
            "refcoco+": {k: 0.0 for k in self.k},
            "refcocog": {k: 0.0 for k in self.k},
        }
        dataset2count = {"refcoco": 0.0, "refcoco+": 0.0, "refcocog": 0.0}
        for item_img, item_ann in tqdm(zip(self.refexp_gt['images'], self.refexp_gt['annotations'])):

            # quit when evaluating all predictions
            if predict_index == len(predictions):
                break
                
            if item_img['id'] != item_ann['image_id']:
                raise ValueError(f"Ann\n{item_ann} \nis not matched\n {item_img}")
            
            dataset_name = item_img['dataset_name']
            img_height = item_img['height']
            img_width = item_img['width']
            caption = item_img['caption']
            target_bbox = item_ann["bbox"]
            converted_bbox = [
                target_bbox[0],
                target_bbox[1],
                target_bbox[2] + target_bbox[0],
                target_bbox[3] + target_bbox[1],
            ]
            target_bbox = torch.as_tensor(converted_bbox).view(-1, 4)

            prediction = predictions[predict_index]["text"]
            try:
                entities, boxes = decode_bbox_from_caption(prediction, img_width, img_height, verbose=verbose)
            except ValueError as e:
                entities, boxes = [], []

            predict_boxes = []
            for (entity, box) in zip(entities, boxes):
                if not are_phrases_similar(entity, caption):
                    if len(box) > 0:
                        predict_boxes.append(box)
                else:
                    predict_boxes.append(box)
            
            if len(predict_boxes) == 0:
                print(f"Can't find valid bbox for the given phrase {caption}, \n{entities, boxes}")
                print(f"We set a 0-area box to calculate result")
                predict_boxes = [[0., 0., 0., 0.]]                                                                                                               

            predict_boxes = torch.as_tensor(predict_boxes).view(-1, 4).to(dtype=torch.float32)
            
            iou, _ = box_iou(predict_boxes, target_bbox)
            mean_iou, _ = box_iou(predict_boxes.mean(0).view(-1, 4), target_bbox)
            for k in self.k:
                if k == 'upper bound':
                    if max(iou) >= self.thresh_iou:
                        dataset2score[dataset_name][k] += 1.0
                elif k == 'mean':
                    if max(mean_iou) >= self.thresh_iou:
                        dataset2score[dataset_name][k] += 1.0
                else:
                    if max(iou[0, :k]) >= self.thresh_iou:
                        dataset2score[dataset_name][k] += 1.0

            dataset2count[dataset_name] += 1.0
            predict_index += 1

        for key, value in dataset2score.items():
            for k in self.k:
                try:
                    value[k] /= dataset2count[key]
                except:
                    pass
                
        results = {}
        for key, value in dataset2score.items():
            results[key] = sorted([v for k, v in value.items()])
            print(f" Dataset: {key} - Precision @ 1, mean, all: {results[key]} \n")
        
        return results


if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument('--prediction_file', help='prediction_file')
    parser.add_argument('--annotation_file', default='/path/to/json_annotations', help='annotation_file')
    
    args = parser.parse_args()
    
    evaluator = RefExpEvaluatorFromJsonl(
        refexp_gt_path=args.annotation_file, 
        k=(1, 'mean', 'upper bound'), 
        thresh_iou=0.5,
    )
    
    results = evaluator.summarize(args.prediction_file, verbose=False)
    
    with open(os.path.join(args.prediction_file, "metric.json"), "w") as f:
        json.dump(results, f, indent=2)

================================================
FILE: ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl
================================================
{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "The man [0.201, 0.002, 0.940, 0.758] is performing a trick on a skateboard [0.012, 0.746, 0.664, 0.886]. He appears to be in mid-air, indicating that he is probably performing a jump or some sort of aerial trick."}
{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "There are people [0.005, 0.562, 0.616, 0.824] standing in the service area [0.003, 0.416, 0.995, 0.996] of the airport, which appears to be a runway. A man [0.659, 0.572, 0.920, 0.844] is walking down the stairs [0.352, 0.676, 1.000, 0.994] of an unloading ramp. Another man [0.008, 0.554, 0.139, 0.800] is holding his baggage [0.107, 0.662, 0.179, 0.750] and bending over. There's also a woman [0.917, 0.610, 1.000, 0.724] dressed in a sleeveless black top [0.960, 0.626, 1.000, 0.668] with a black and white headband [0.923, 0.628, 0.952, 0.646]. There are several pieces of luggage [0.368, 0.702, 0.456, 0.780] [0.907, 0.706, 0.973, 0.786] on the ground [0.000, 0.282, 1.000, 0.976]."}
{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "There is a train [0.002, 0.408, 1.000, 0.683] on the bridge [0.002, 0.608, 0.988, 0.877]."}
{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "In the image, there are several children each holding a stuffed animal. A boy [0.102, 0.388, 0.498, 1.000] is holding a stuffed dog [0.391, 0.506, 0.622, 0.714], and he is wearing a black jacket [0.077, 0.520, 0.495, 0.910]. Another boy [0.717, 0.188, 1.000, 0.864] is holding a white stuffed animal [0.695, 0.356, 0.868, 0.580] wearing a red jacket. There is also a girl [0.372, 0.196, 0.603, 0.922] holding a grey stuffed dog [0.286, 0.298, 0.517, 0.422]. Another child [0.342, 0.390, 0.622, 1.000] is seen holding up a white stuffed animal [0.286, 0.298, 0.517, 0.422]. Furthermore, there is a baby [0.385, 0.034, 0.643, 0.434] being held by a lady [0.286, 0.000, 0.683, 0.560]."}
{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "The man [0.075, 0.102, 0.704, 0.716] is sitting at a table [0.000, 0.592, 0.997, 1.000] and looking at a plate of food [0.416, 0.726, 0.856, 0.904]. There is a glass [0.275, 0.716, 0.501, 0.998] and a bottle [0.048, 0.712, 0.195, 1.002] on the table in front of him."}
{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "The girl [0.112, 0.091, 0.868, 0.992] is petting a sheep [0.000, 0.003, 0.704, 0.320]. The hand [0.418, 0.373, 0.548, 0.592] of the girl is on the sheep."}
{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "The main features of the building [0.000, 0.000, 1.000, 0.466] include a door [0.110, 0.370, 0.266, 0.518] with a picture [0.155, 0.378, 0.259, 0.442] on it. There's a stop light [0.366, 0.236, 0.638, 0.394] with a sad face [0.383, 0.244, 0.614, 0.384] on it nearby the building. In front of the building, there is a road [0.000, 0.490, 1.000, 1.000] with arrows [0.000, 0.616, 0.214, 0.644] painted on it. Next to the road, there is a sidewalk [0.666, 0.572, 0.993, 0.618]."}
{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "The man [0.000, 0.024, 0.835, 1.002] in this image appears to be installing or fixing a toilet [0.564, 0.574, 1.000, 0.974]. He is wearing gloves for the task and is kneeling next to the toilet."}
{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "On the desk, there are several items. There is a computer mouse [0.414, 0.753, 0.470, 0.811], a keyboard [0.415, 0.620, 0.650, 0.783], and a cup [0.350, 0.783, 0.417, 0.906]. A plate [0.183, 0.799, 0.326, 0.896] with a fork [0.203, 0.794, 0.270, 0.857] on it is also on the desk."}
{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "The man [0.390, 0.432, 0.466, 0.793] is jumping in the grass [0.000, 0.610, 0.998, 0.997] to catch a Frisbee [0.400, 0.354, 0.446, 0.381] in the air. He is extending his arm [0.416, 0.459, 0.432, 0.565] to reach the Frisbee. He is wearing a shirt [0.402, 0.468, 0.458, 0.649] and pants [0.390, 0.658, 0.424, 0.763]. His shadow [0.492, 0.724, 0.622, 0.994] can be seen in the grass."}
{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "The image depicts a small harbor scene where multiple boats [0.000, 0.461, 0.354, 0.579] [0.348, 0.501, 0.874, 0.789] [0.302, 0.461, 0.684, 0.611] are docked on land next to a body of water [0.000, 0.259, 1.000, 0.469]. There is a chain-link fence [0.002, 0.176, 0.996, 0.995] enclosing the boats. There are also surfboards [0.830, 0.448, 0.996, 0.552] [0.420, 0.384, 0.502, 0.411] [0.910, 0.768, 0.998, 0.877] [0.430, 0.344, 0.508, 0.371] [0.830, 0.565, 1.000, 0.712] [0.322, 0.307, 0.450, 0.341] [0.766, 0.251, 0.998, 0.368] [0.764, 0.704, 0.998, 0.829] and a rack [0.754, 0.168, 1.000, 0.901] in the scene. In the background, you can see a skyline of buildings [0.692, 0.195, 0.718, 0.248] [0.888, 0.173, 0.922, 0.227] [0.582, 0.211, 0.610, 0.256] [0.180, 0.259, 0.202, 0.293] [0.466, 0.208, 0.518, 0.272] under the blue sky [0.000, 0.000, 0.998, 0.317]."}
{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "The man [0.000, 0.133, 0.600, 0.992] is interacting with a toilet [0.016, 0.042, 0.719, 0.996]. He is reaching out his hand [0.281, 0.125, 0.603, 0.562] to flush the toilet using the flusher [0.534, 0.092, 0.628, 0.300] located on top of the toilet tank [0.019, 0.021, 0.706, 0.579]."}
{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "There is a blueberry cake [0.238, 0.093, 0.786, 0.787] topped with butter [0.454, 0.024, 0.638, 0.288] placed on a plate [0.166, 0.453, 1.000, 1.000] which is on the table [0.002, 0.365, 0.998, 0.997]. On the same table, there is another plate [0.628, 0.120, 0.998, 0.389] containing a mix of food [0.632, 0.123, 0.996, 0.336] including an egg [0.636, 0.125, 0.880, 0.267] and a sausage [0.766, 0.248, 0.984, 0.333]. There is also a cup [0.002, 0.000, 0.202, 0.667] of water [0.000, 0.000, 0.202, 0.667] on the table."}
{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "The items placed on the table are a pair of napkins [0.541, 0.818, 0.601, 0.858], a pepper shaker [0.594, 0.822, 0.619, 0.854], and a salt shaker [0.612, 0.824, 0.637, 0.854]."}
{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "The image shows a woman [0.009, 0.194, 0.497, 0.888] sitting on a wooden bench [0.000, 0.324, 0.731, 0.994] in a park [0.000, 0.002, 0.997, 1.000] during daytime. The park appears to have a lot of trees [0.554, 0.000, 0.997, 0.376] and there are people [0.386, 0.438, 0.449, 0.504] walking in front of the woman. The woman's purse [0.458, 0.488, 0.605, 0.694] is also on the bench next to her. The park seems to be enclosed by a fence [0.719, 0.310, 0.997, 0.372] and there is a building [0.090, 0.000, 0.686, 0.094] behind the trees."}
{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "The image features a green and white bus [0.100, 0.376, 0.866, 0.805] that is driving down a city street."}
{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "There are two zebras in the image. The first zebra [0.000, 0.000, 0.622, 0.790] and the second zebra [0.002, 0.000, 0.684, 0.682]."}
{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "The person riding the horse is wearing boots [0.328, 0.339, 0.416, 0.492] and a shirt [0.388, 0.150, 0.508, 0.279]. They are also wearing a helmet [0.484, 0.096, 0.560, 0.162]. The person [0.320, 0.078, 0.552, 0.502] is riding the horse."}
{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "The two men [0.144, 0.360, 0.246, 0.736] [0.730, 0.474, 0.780, 0.613] are playing a game of tennis [0.012, 0.384, 0.984, 0.934]. The first man is wearing a white shirt [0.164, 0.411, 0.222, 0.547], gray shorts [0.162, 0.535, 0.220, 0.628], and black sneakers [0.180, 0.709, 0.216, 0.739]. The second man is wearing white clothing [0.734, 0.492, 0.778, 0.601] and white sneakers [0.762, 0.598, 0.776, 0.613]. They are on a tennis court [0.000, 0.372, 0.988, 0.979] and are currently hitting a ball [0.640, 0.399, 0.648, 0.411] with their tennis rackets [0.214, 0.574, 0.238, 0.619] [0.768, 0.526, 0.808, 0.556]."}
{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "The airplane [0.112, 0.300, 0.858, 0.640] is in the air, currently in flight. This can be inferred from the position of the sky [0.000, 0.000, 0.998, 1.000]. However, the airplane's landing gear [0.450, 0.592, 0.600, 0.643] is lowered, which indicates the airplane is in the process of landing or taking off."}
{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "This image is set at a train station. There's a train [0.290, 0.400, 0.998, 0.784] moving on the railway tracks [0.000, 0.752, 0.520, 0.944]. The train station has a platform with an awning [0.000, 0.299, 0.132, 0.403] and a bench under a canopy [0.000, 0.301, 0.136, 0.400]. Near the platform, there is a building [0.000, 0.299, 0.204, 0.659] with a red brick wall [0.000, 0.392, 0.206, 0.611]. The station is surrounded by trees [0.208, 0.253, 0.322, 0.653] and the sky [0.000, 0.000, 0.998, 0.560] above shows some clouds [0.374, 0.067, 0.920, 0.312]."}
{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "There are three people in the image. One person [0.066, 0.162, 0.318, 0.686] is wearing a black uniform [0.000, 0.222, 0.126, 0.646] and a hat [0.006, 0.162, 0.072, 0.198]. Another person [0.390, 0.344, 0.838, 0.894] is wearing a long black coat [0.405, 0.332, 0.835, 0.746] and pants [0.523, 0.736, 0.739, 0.858]. The last person [0.853, 0.154, 1.000, 0.650] is wearing jeans [0.853, 0.422, 1.000, 0.632]."}
{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Sure, the bike [0.146, 0.109, 0.938, 0.803] is a Honda, as indicated by the Honda logo [0.322, 0.395, 0.378, 0.419]. It has a front wheel [0.150, 0.424, 0.366, 0.635] and a back tire [0.574, 0.496, 0.860, 0.800]. It also has a light [0.894, 0.411, 0.944, 0.520]. The bike features a shock absorber [0.626, 0.501, 0.698, 0.680] for smooth riding. It also has a handle [0.284, 0.109, 0.390, 0.384] for steering and a display [0.240, 0.275, 0.290, 0.328] for the rider's information. Not to forget the sylencer [0.462, 0.645, 0.816, 0.779] near the back tire."}
{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "The setting is a mountainous region. There is a large mountain [0.000, 0.057, 0.992, 0.782] with a snow-covered peak [0.744, 0.042, 0.898, 0.119]. In front of the mountain, there is a lush pasture [0.000, 0.815, 0.984, 1.000] where cows [0.548, 0.860, 0.574, 0.896] [0.436, 0.860, 0.454, 0.890] are grazing. There are trailers [0.796, 0.910, 0.894, 0.997] [0.632, 0.899, 0.742, 0.994] in the pasture, probably for animal equipment and transportation. There are also trees [0.740, 0.409, 1.000, 0.982] around the area. All of this is under a clear sky [0.000, 0.000, 1.002, 0.257]."}
{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "The zebras [0.730, 0.496, 0.796, 0.581] are in a fenced area [0.464, 0.531, 0.934, 0.848]. Near them, there is a truck [0.000, 0.416, 0.210, 0.805] on the road [0.180, 0.709, 0.432, 0.957]. They are also surrounded by trees [0.128, 0.000, 0.592, 0.597] and grass [0.544, 0.659, 0.840, 0.859]."}
{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "The boy [0.792, 0.480, 0.938, 0.853] is holding a surfboard [0.790, 0.587, 0.960, 0.691]."}
{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "There are three people in the image. One man [0.164, 0.455, 0.292, 0.997] is standing on the side [0.236, 0.675, 0.994, 0.997] wearing shorts [0.174, 0.699, 0.254, 0.864]. Another man [0.582, 0.476, 0.662, 0.870] is standing beside the elephant [0.328, 0.157, 0.638, 0.967] wearing a shirt [0.582, 0.521, 0.650, 0.681]. There is also a woman [0.288, 0.473, 0.420, 0.967] wearing a top [0.302, 0.539, 0.358, 0.696] touching the elephant. They are all on the side of a body of water [0.000, 0.488, 0.994, 1.000]."}
{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "The boat [0.048, 0.552, 0.928, 0.819] is white and is of a large size. It has multiple levels [0.000, 0.709, 1.000, 0.829] [0.068, 0.616, 0.852, 0.688]. The side of the boat has a set of long black windows [0.374, 0.733, 0.790, 0.765]. Further, it has a silver railing [0.094, 0.557, 0.728, 0.624] [0.238, 0.597, 0.744, 0.627] on the top level. There are also red letters [0.414, 0.693, 0.654, 0.725] and blue water symbols [0.268, 0.688, 0.350, 0.779] on the side of the boat."}
{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "A person [0.490, 0.136, 0.825, 0.998] with red hair [0.507, 0.142, 0.791, 0.642] is holding a cake [0.630, 0.670, 0.772, 0.750]. She is wearing a suit jacket [0.490, 0.422, 0.799, 0.998] and a necktie [0.571, 0.442, 0.674, 0.936]."}
{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "In the image, there is a group of sheep [0.532, 0.546, 0.646, 0.662] [0.532, 0.666, 0.817, 0.810] grazing in a field [0.000, 0.002, 0.994, 0.998]. The field is bordered by a stone wall [0.000, 0.000, 0.769, 0.180] and is filled with plant life [0.000, 0.764, 0.601, 0.998]. There is also a bush [0.480, 0.000, 0.748, 0.084] and some trees [0.736, 0.036, 0.835, 0.100] present in the field. A few rocks [0.727, 0.410, 0.808, 0.470] can also be spotted in the scene."}
{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "A child [0.408, 0.168, 0.606, 0.786] is holding the umbrella [0.296, 0.038, 0.782, 0.360]."}
{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "There is a man [0.274, 0.000, 0.517, 0.792] standing in the kitchen [0.000, 0.000, 0.750, 0.849]. He is standing next to a counter [0.000, 0.329, 0.576, 0.398]. On the floor of the kitchen [0.000, 0.713, 1.000, 1.000], there is a small dog [0.462, 0.593, 0.568, 0.842]. Mugs [0.509, 0.123, 0.595, 0.266] are hanging on the wall [0.506, 0.019, 0.607, 0.384]. There is also a blender [0.015, 0.165, 0.080, 0.307] on the counter."}
{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "There are several books [0.414, 0.208, 0.538, 0.364] [0.360, 0.202, 0.417, 0.360] [0.435, 0.480, 0.712, 0.578] and a bowl [0.072, 0.030, 0.288, 0.076] on the top shelf [0.000, 0.028, 0.607, 0.202]. On the middle shelf [0.207, 0.334, 0.997, 0.380], there are more books [0.414, 0.208, 0.538, 0.364] [0.360, 0.202, 0.417, 0.360]. The bottom shelf [0.324, 0.528, 0.997, 0.624] contains a stack of books [0.435, 0.480, 0.712, 0.578]."}
{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "The man [0.163, 0.274, 0.491, 0.936] is preparing to serve a tennis ball. He is holding a tennis racket [0.235, 0.578, 0.304, 0.664] in his hand [0.253, 0.648, 0.299, 0.680]. He is wearing a cap [0.171, 0.388, 0.253, 0.476] on his head [0.173, 0.408, 0.256, 0.474], and shorts [0.216, 0.628, 0.432, 0.782]. His shadow [0.397, 0.898, 0.968, 0.956] is cast in front of him."}
{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "The image shows two women [0.102, 0.099, 0.486, 0.984] [0.502, 0.000, 0.982, 0.997], both of them are eating hot dogs [0.190, 0.587, 0.350, 0.741] [0.676, 0.315, 0.882, 0.408]. One of the women is wearing sunglasses [0.630, 0.005, 0.794, 0.048] on her head. They seem to be standing on a street [0.042, 0.403, 0.092, 0.520], potentially walking while enjoying their meal."}
{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "There is a man [0.542, 0.343, 0.812, 0.493] and a woman [0.644, 0.377, 0.834, 0.863] sitting on a bench [0.070, 0.493, 0.932, 0.960]. They are looking at the ocean [0.028, 0.319, 0.972, 0.821]. The man has his arm [0.658, 0.462, 0.828, 0.496] around the woman."}
{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "The catcher [0.334, 0.193, 0.756, 0.940] is squatting on the field, ready to catch a ball with his gloved hand extended."}
{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "One girl [0.329, 0.148, 0.973, 0.892] is holding a tennis racket [0.462, 0.480, 0.713, 0.840] and the other girl [0.057, 0.102, 0.456, 0.898] is standing next to a scooter [0.097, 0.424, 0.592, 0.996]."}
{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "The skater has taken several safety measures. He is wearing a helmet [0.358, 0.354, 0.448, 0.422] which has a sticker [0.408, 0.358, 0.438, 0.368]. He also has a pad [0.540, 0.362, 0.595, 0.420] on his body. His elbow is protected by another pad [0.376, 0.512, 0.443, 0.554] and his knee is protected by a knee pad [0.450, 0.542, 0.512, 0.598]. He is also wearing a wrist brace [0.279, 0.524, 0.338, 0.564]. The skater is also wearing roller skates [0.647, 0.490, 0.709, 0.584] for the activity."}
{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "The objects on the bus include a design [0.228, 0.422, 0.438, 0.560], side mirrors [0.488, 0.314, 0.530, 0.428] [0.790, 0.332, 0.818, 0.455], wheels [0.266, 0.545, 0.294, 0.677] [0.248, 0.551, 0.264, 0.668] [0.444, 0.578, 0.472, 0.751], windows [0.510, 0.216, 0.796, 0.548] and a windshield [0.518, 0.222, 0.782, 0.545]. The bus [0.222, 0.144, 0.820, 0.757] itself."}


================================================
FILE: ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl
================================================
{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : axle at [0.447, 0.814, 0.535, 0.856].\nObject 1 : background at [0.003, 0.744, 0.994, 0.988].\nObject 2 : bracelet at [0.820, 0.444, 0.859, 0.470].\nObject 3 : building at [0.012, 0.888, 0.099, 0.994].\nObject 4 : corner at [0.027, 0.890, 0.117, 0.992].\nObject 5 : fence at [0.030, 0.886, 1.000, 1.000].\nObject 6 : hair at [0.486, 0.078, 0.712, 0.216].\nObject 7 : jean pants at [0.246, 0.380, 0.841, 0.632].\nObject 8 : laces at [0.168, 0.562, 0.850, 0.674].\nObject 9 : logo at [0.429, 0.232, 0.583, 0.364].\nObject 10 : man at [0.201, 0.002, 0.940, 0.758].\nObject 11 : name at [0.000, 0.960, 0.321, 1.000].\nObject 12 : picture at [0.003, 0.004, 1.000, 0.998].\nObject 13 : poles at [0.180, 0.886, 0.432, 0.990].\nObject 14 : shirt at [0.324, 0.124, 0.694, 0.392].\nObject 15 : shoes at [0.189, 0.606, 0.946, 0.792].\nObject 16 : skateboard at [0.012, 0.746, 0.664, 0.886].\nObject 17 : sky at [0.012, 0.002, 1.000, 0.918].\nObject 18 : stadium lights at [0.147, 0.860, 0.456, 0.994].\nObject 19 : stitching at [0.312, 0.408, 0.754, 0.638].\nObject 20 : strip at [0.279, 0.770, 0.529, 0.802].\nObject 21 : top at [0.024, 0.830, 0.420, 0.936].\nObject 22 : trees at [0.024, 0.846, 1.000, 1.000].\nObject 23 : wheels at [0.012, 0.808, 0.586, 0.904].\nObject 24 : wrist at [0.802, 0.434, 0.856, 0.484].\n\nRelationships:\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 23 : wheels -> on a -> object 16 : skateboard.\nobject 14 : shirt -> has a -> object 9 : logo.\nobject 10 : man -> doing trick on -> object 16 : skateboard.\nobject 3 : building -> behind a -> object 5 : fence.\nobject 11 : name -> on -> object 12 : picture.\nobject 11 : name -> has a -> object 11 : name.\nobject 10 : man -> performing on a -> object 16 : skateboard.\nobject 4 : corner -> of -> object 3 : building.\nobject 18 : stadium lights -> are on -> object 13 : poles.\nobject 16 : skateboard -> has -> object 23 : wheels.\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 11 : name -> on -> object 12 : picture.\nobject 16 : skateboard -> under -> object 10 : man.\nobject 10 : man -> wearing -> object 15 : shoes.\nobject 3 : building -> behind -> object 5 : fence.\nobject 22 : trees -> in -> object 1 : background.\nobject 15 : shoes -> have -> object 8 : laces.\nobject 18 : stadium lights -> on -> object 13 : poles.\nobject 5 : fence -> behind -> object 10 : man.\nobject 20 : strip -> on -> object 16 : skateboard.\nobject 19 : stitching -> on -> object 7 : jean pants.\nobject 9 : logo -> on -> object 14 : shirt.\nobject 23 : wheels -> on -> object 16 : skateboard.\nobject 0 : axle -> on -> object 16 : skateboard.\nobject 21 : top -> of -> object 22 : trees.\n\nRegion Description:\nRegion Description at [0.030, 0.774, 0.643, 0.912] : a black skateboard with black wheels.\n\nGlobal Caption:\nA man flying through the air while riding a skateboard.\nA man is doing tricks on a skateboard.\nA skateboarder jumps while trying to perform a trick.\na man in the air standing above the skateboard\na person attempting a jump with a skateboard"}
{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : baggage at [0.107, 0.662, 0.179, 0.750].\nObject 1 : baggage at [0.368, 0.706, 0.456, 0.782].\nObject 2 : building at [0.000, 0.000, 0.997, 0.326].\nObject 3 : cap at [0.784, 0.544, 0.824, 0.568].\nObject 4 : duffel bag at [0.584, 0.702, 0.643, 0.768].\nObject 5 : ground at [0.000, 0.282, 1.000, 0.976].\nObject 6 : hair at [0.920, 0.614, 0.973, 0.640].\nObject 7 : headband at [0.923, 0.628, 0.952, 0.646].\nObject 8 : jacket at [0.776, 0.568, 0.840, 0.642].\nObject 9 : line at [0.696, 0.750, 0.989, 0.794].\nObject 10 : lines at [0.000, 0.436, 0.851, 0.486].\nObject 11 : luggage at [0.907, 0.706, 0.973, 0.786].\nObject 12 : luggage at [0.368, 0.702, 0.456, 0.780].\nObject 13 : man at [0.008, 0.554, 0.139, 0.800].\nObject 14 : man at [0.659, 0.572, 0.920, 0.844].\nObject 15 : man at [0.771, 0.538, 0.843, 0.640].\nObject 16 : pavement at [0.003, 0.308, 0.992, 0.566].\nObject 17 : people at [0.005, 0.562, 0.616, 0.824].\nObject 18 : pillars at [0.211, 0.130, 0.235, 0.240].\nObject 19 : ramp at [0.179, 0.158, 0.707, 0.408].\nObject 20 : service area at [0.003, 0.416, 0.995, 0.996].\nObject 21 : stairs at [0.352, 0.676, 1.000, 0.994].\nObject 22 : sweater at [0.667, 0.634, 0.920, 0.824].\nObject 23 : top at [0.960, 0.626, 1.000, 0.668].\nObject 24 : truck at [0.781, 0.278, 0.997, 0.366].\nObject 25 : walls at [0.608, 0.000, 0.989, 0.320].\nObject 26 : wheel at [0.843, 0.338, 0.875, 0.366].\nObject 27 : woman at [0.917, 0.610, 1.000, 0.724].\n\nRelationships:\nobject 17 : people -> in -> object 20 : service area.\nobject 27 : woman -> bends over -> object 11 : luggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 12 : luggage -> on -> object 5 : ground.\nobject 13 : man -> carries -> object 0 : baggage.\nobject 14 : man -> wears -> object 22 : sweater.\nobject 15 : man -> wears -> object 3 : cap.\nobject 24 : truck -> in -> object 20 : service area.\nobject 15 : man -> wears -> object 8 : jacket.\nobject 10 : lines -> on -> object 16 : pavement.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 9 : line -> on -> object 16 : pavement.\nobject 24 : truck -> has -> object 26 : wheel.\nobject 2 : building -> has -> object 25 : walls.\nobject 15 : man -> on -> object 20 : service area.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 27 : woman -> wears -> object 7 : headband.\nobject 1 : baggage -> on -> object 20 : service area.\n\nRegion Description:\nRegion Description at [0.443, 0.528, 0.992, 0.850] : People standing in service area of airport..\nRegion Description at [0.648, 0.564, 0.960, 0.892] : Man walking down stairs of unloading ramp..\nRegion Description at [0.229, 0.698, 0.381, 0.776] : Black and red luggage sitting on ground..\nRegion Description at [0.957, 0.616, 0.997, 0.670] : Woman dressed in sleeveless black top..\nRegion Description at [0.011, 0.548, 0.211, 0.750] : Man holding his luggage and bending over.\nRegion Description at [0.893, 0.578, 0.995, 0.678] : woman with a black and white head band.\nRegion Description at [0.235, 0.684, 0.973, 0.816] : Rainbow of colors in the form of luggage.\n\nGlobal Caption:\nSome are standing outside a building with suitcases.\nA few people are getting of a plane.\nA group of people and luggage on a airport tarmac.\nSome people who are placing luggage on a runway.\nAn airport and plane unloading passengers with luggage."}
{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : air conditioner at [0.004, 0.261, 0.018, 0.293].\nObject 1 : balcony at [0.048, 0.037, 0.100, 0.077].\nObject 2 : beam at [0.616, 0.621, 0.664, 0.824].\nObject 3 : beam at [0.490, 0.640, 0.532, 0.832].\nObject 4 : beam at [0.426, 0.640, 0.462, 0.835].\nObject 5 : bridge at [0.002, 0.608, 0.988, 0.877].\nObject 6 : bridge at [0.004, 0.453, 1.000, 0.867].\nObject 7 : building at [0.000, 0.000, 0.252, 0.469].\nObject 8 : bushes at [0.000, 0.939, 0.072, 0.997].\nObject 9 : colors at [0.194, 0.480, 0.330, 0.661].\nObject 10 : column at [0.618, 0.824, 0.676, 0.997].\nObject 11 : guard rails at [0.000, 0.496, 1.000, 0.624].\nObject 12 : light at [0.606, 0.192, 0.724, 0.243].\nObject 13 : light at [0.864, 0.947, 0.916, 1.000].\nObject 14 : metal support at [0.002, 0.603, 0.976, 0.995].\nObject 15 : pole at [0.700, 0.205, 0.724, 0.995].\nObject 16 : red line at [0.632, 0.851, 0.648, 0.995].\nObject 17 : sky at [0.250, 0.013, 1.000, 0.467].\nObject 18 : south west at [0.338, 0.616, 0.442, 0.651].\nObject 19 : street at [0.002, 0.861, 1.000, 0.997].\nObject 20 : train at [0.002, 0.408, 1.000, 0.683].\nObject 21 : window at [0.144, 0.013, 0.182, 0.064].\nObject 22 : window at [0.430, 0.485, 0.534, 0.595].\nObject 23 : window at [0.134, 0.091, 0.182, 0.155].\nObject 24 : window at [0.340, 0.504, 0.424, 0.613].\nObject 25 : window at [0.116, 0.944, 0.168, 1.000].\nObject 26 : windows at [0.762, 0.437, 0.920, 0.613].\nObject 27 : windows at [0.004, 0.000, 0.096, 0.088].\n\nRelationships:\nobject 10 : column -> supporting -> object 6 : bridge.\nobject 10 : column -> has -> object 16 : red line.\nobject 12 : light -> on -> object 15 : pole.\nobject 7 : building -> behind -> object 20 : train.\nobject 21 : window -> on -> object 7 : building.\nobject 1 : balcony -> on -> object 7 : building.\nobject 25 : window -> visible under -> object 5 : bridge.\nobject 12 : light -> on -> object 19 : street.\nobject 2 : beam -> of -> object 5 : bridge.\nobject 20 : train -> in -> object 9 : colors.\nobject 24 : window -> of -> object 20 : train.\nobject 22 : window -> of train -> object 20 : train.\nobject 5 : bridge -> on -> object 20 : train.\nobject 7 : building -> beside -> object 20 : train.\nobject 23 : window -> of -> object 7 : building.\nobject 12 : light -> on a -> object 15 : pole.\nobject 12 : light -> on -> object 15 : pole.\nobject 20 : train -> says -> object 18 : south west.\nobject 8 : bushes -> are in -> object 19 : street.\nobject 7 : building -> has many -> object 27 : windows.\nobject 7 : building -> has -> object 0 : air conditioner.\nobject 20 : train -> on -> object 6 : bridge.\nobject 12 : light -> in -> object 19 : street.\nobject 5 : bridge -> has -> object 11 : guard rails.\nobject 26 : windows -> on -> object 20 : train.\nobject 20 : train -> has -> object 18 : south west.\nobject 6 : bridge -> has -> object 14 : metal support.\nobject 9 : colors -> to -> object 20 : train.\n\nRegion Description:\nRegion Description at [0.602, 0.837, 0.696, 0.997] : a metal support column for the bridge.\n\nGlobal Caption:\nA train as it travels down the tracks over a bridge.\na colorful train going along an elevated track \nA train rides on a bridge past a building.\nA subway train that is passing over a train bridge.\na train on a train track on an elevated bridge"}
{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : animal at [0.717, 0.042, 0.831, 0.152].\nObject 1 : animal at [0.114, 0.582, 0.348, 0.840].\nObject 2 : baby at [0.385, 0.034, 0.643, 0.434].\nObject 3 : baby at [0.911, 0.028, 1.000, 0.250].\nObject 4 : bear at [0.391, 0.506, 0.622, 0.714].\nObject 5 : bear at [0.695, 0.356, 0.868, 0.580].\nObject 6 : bear hand at [0.114, 0.630, 0.175, 0.660].\nObject 7 : black sock at [0.800, 0.796, 0.858, 0.834].\nObject 8 : blonde boy at [0.166, 0.170, 0.351, 0.460].\nObject 9 : boy at [0.102, 0.388, 0.498, 1.000].\nObject 10 : boy at [0.717, 0.188, 1.000, 0.864].\nObject 11 : child at [0.342, 0.390, 0.622, 1.000].\nObject 12 : coat at [0.077, 0.520, 0.495, 0.910].\nObject 13 : coat at [0.775, 0.296, 1.000, 0.616].\nObject 14 : coat at [0.397, 0.090, 0.634, 0.262].\nObject 15 : flip flops at [0.434, 0.756, 0.606, 0.910].\nObject 16 : girl at [0.372, 0.196, 0.603, 0.922].\nObject 17 : glasses at [0.191, 0.236, 0.308, 0.250].\nObject 18 : grass at [0.637, 0.652, 0.754, 0.788].\nObject 19 : hand at [0.714, 0.094, 0.788, 0.160].\nObject 20 : hands at [0.763, 0.380, 0.877, 0.430].\nObject 21 : hat at [0.757, 0.030, 0.889, 0.078].\nObject 22 : jacket at [0.357, 0.500, 0.622, 0.782].\nObject 23 : jacket at [0.422, 0.286, 0.603, 0.550].\nObject 24 : jacket at [0.163, 0.296, 0.320, 0.462].\nObject 25 : jacket at [0.911, 0.106, 1.000, 0.224].\nObject 26 : lady at [0.286, 0.000, 0.683, 0.560].\nObject 27 : man at [0.628, 0.030, 0.951, 0.742].\nObject 28 : shirt at [0.831, 0.306, 0.957, 0.404].\nObject 29 : shirt at [0.197, 0.296, 0.298, 0.370].\nObject 30 : shoe at [0.717, 0.804, 0.871, 0.864].\nObject 31 : sidewalk at [0.628, 0.574, 0.769, 0.632].\nObject 32 : stuffed animal at [0.286, 0.298, 0.517, 0.422].\n\nRelationships:\nobject 10 : boy -> wearing -> object 28 : shirt.\nobject 3 : baby -> wearing -> object 25 : jacket.\nobject 22 : jacket -> carrying -> object 4 : bear.\nobject 8 : blonde boy -> wears -> object 17 : glasses.\nobject 8 : blonde boy -> wears -> object 24 : jacket.\nobject 11 : child -> holding up -> object 32 : stuffed animal.\nobject 10 : boy -> holding up -> object 5 : bear.\nobject 30 : shoe -> with a -> object 7 : black sock.\nobject 10 : boy -> wearing -> object 7 : black sock.\nobject 26 : lady -> holding -> object 2 : baby.\nobject 16 : girl -> wearing -> object 15 : flip flops.\nobject 9 : boy -> wearing -> object 12 : coat.\nobject 10 : boy -> wearing a -> object 13 : coat.\nobject 4 : bear -> on -> object 20 : hands.\nobject 26 : lady -> carrying -> object 2 : baby.\nobject 0 : animal -> in -> object 19 : hand.\n\nRegion Description:\nRegion Description at [0.905, 0.020, 0.997, 0.272] : blonde haired baby wearing yellow jacket.\nRegion Description at [0.357, 0.388, 0.640, 0.730] : girl in blue jacket carrying blue dog.\nRegion Description at [0.071, 0.378, 0.498, 0.842] : boy in black jacket holding stuffed dog.\nRegion Description at [0.055, 0.572, 0.375, 0.846] : brown stuffed dog with red and white collar.\nRegion Description at [0.283, 0.194, 0.603, 0.400] : girl in pink jacket holding white stuffed animal.\nRegion Description at [0.695, 0.356, 0.874, 0.576] : White stuffed animal wearing a red jacket..\nRegion Description at [0.332, 0.394, 0.618, 0.992] : Little girl holding a grey stuffed dog..\nRegion Description at [0.372, 0.476, 0.723, 0.786] : little girl holding blue and white stuffed animal.\nRegion Description at [0.062, 0.556, 0.422, 0.840] : little boy holding brown and white stuffed animal.\n\nGlobal Caption:\na bunch of kids walking through some grass\nA group of children are holding various stuffed animals and dolls.\nKids walking while holding their stuffed animals. \nA group of kids holding teddy bears and looking happy.\nA group of children carrying stuffed animals walks across the grass. "}
{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bottle at [0.048, 0.712, 0.195, 1.002].\nObject 1 : chair at [0.696, 0.500, 1.003, 0.718].\nObject 2 : cork at [0.053, 0.712, 0.139, 0.776].\nObject 3 : cup at [0.043, 0.736, 0.240, 0.916].\nObject 4 : dish at [0.416, 0.726, 0.856, 0.904].\nObject 5 : fruit at [0.629, 0.834, 0.675, 0.880].\nObject 6 : glass at [0.275, 0.716, 0.501, 0.998].\nObject 7 : glasses at [0.179, 0.242, 0.464, 0.322].\nObject 8 : hair at [0.536, 0.258, 0.656, 0.320].\nObject 9 : man at [0.075, 0.102, 0.704, 0.716].\nObject 10 : rasberries at [0.499, 0.750, 0.544, 0.786].\nObject 11 : raspberries at [0.664, 0.828, 0.741, 0.864].\nObject 12 : sauce at [0.565, 0.752, 0.715, 0.824].\nObject 13 : shirt at [0.600, 0.350, 0.645, 0.494].\nObject 14 : shirt at [0.635, 0.282, 0.997, 0.654].\nObject 15 : sign at [0.419, 0.134, 0.509, 0.184].\nObject 16 : sweater at [0.072, 0.288, 0.704, 0.718].\nObject 17 : table at [0.000, 0.592, 0.997, 1.000].\nObject 18 : window at [0.328, 0.000, 0.600, 0.298].\nObject 19 : woman at [0.531, 0.258, 0.768, 0.688].\n\nRelationships:\nobject 9 : man -> wearing -> object 7 : glasses.\nobject 0 : bottle -> on -> object 17 : table.\nobject 6 : glass -> on -> object 17 : table.\nobject 11 : raspberries -> on -> object 4 : dish.\nobject 9 : man -> wearing -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.640, 0.180, 0.989, 0.530] : Man wearing a black and orange stripe shirt.\nRegion Description at [0.413, 0.136, 0.512, 0.184] : Yellow closed sign with brown letters.\nRegion Description at [0.629, 0.186, 0.995, 0.706] : a man wearing and orange and black striped shirt.\nRegion Description at [0.528, 0.254, 0.717, 0.666] : a woman with a ponytail eating lunch.\nRegion Description at [0.152, 0.238, 0.459, 0.322] : a pair of black wire rimmed eye glasses.\nRegion Description at [0.029, 0.716, 0.243, 0.922] : empty cup that used to contain coffee.\nRegion Description at [0.264, 0.708, 0.867, 0.994] : A plate of food with a glass of water.\n\nGlobal Caption:\nA man sitting in front of a plate of food.\nA man at a wooden table looking at a plate of food.\na man smiling while looking at his plate of food\nA man sitting at a table with a plate filled with food.\nA man looking happily at some dish in front of him."}
{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : edge at [0.498, 0.296, 0.530, 0.419].\nObject 1 : feet at [0.914, 0.235, 1.000, 0.291].\nObject 2 : floor at [0.000, 0.003, 0.994, 1.000].\nObject 3 : girl at [0.112, 0.091, 0.868, 0.992].\nObject 4 : grass at [0.000, 0.005, 0.998, 0.995].\nObject 5 : ground at [0.000, 0.005, 0.992, 0.992].\nObject 6 : hair at [0.542, 0.096, 0.910, 0.624].\nObject 7 : hand at [0.418, 0.373, 0.548, 0.592].\nObject 8 : jeans at [0.500, 0.216, 0.584, 0.400].\nObject 9 : sheep at [0.000, 0.003, 0.704, 0.320].\nObject 10 : shirt at [0.426, 0.504, 0.900, 0.992].\nObject 11 : shoe at [0.472, 0.379, 0.558, 0.453].\nObject 12 : sneakers at [0.904, 0.141, 0.968, 0.187].\nObject 13 : someon at [0.884, 0.019, 0.994, 0.192].\nObject 14 : strap at [0.744, 0.600, 0.872, 0.715].\nObject 15 : strip at [0.512, 0.520, 0.548, 0.589].\nObject 16 : sweater at [0.500, 0.555, 0.818, 0.997].\nObject 17 : tie at [0.532, 0.515, 0.546, 0.579].\nObject 18 : wool at [0.016, 0.171, 0.114, 0.411].\n\nRelationships:\nobject 7 : hand -> on -> object 9 : sheep.\nobject 3 : girl -> with -> object 10 : shirt.\nobject 6 : hair -> on -> object 3 : girl.\nobject 3 : girl -> has -> object 6 : hair.\nobject 6 : hair -> on -> object 3 : girl.\n\nRegion Description:\nRegion Description at [0.000, 0.027, 0.530, 0.744] : a sheep that has been recently shorn.\nRegion Description at [0.116, 0.032, 0.924, 0.992] : girl in front has gray sweater hanging over her left shoulder.\nRegion Description at [0.506, 0.045, 0.912, 0.845] : girl in front is facing away from the camera.\nRegion Description at [0.120, 0.053, 0.890, 0.989] : girl in front wears a gray and white striped T-shirt.\nRegion Description at [0.300, 0.005, 0.624, 0.451] : someone in jeans and brown shoes stands behind the sheep.\nRegion Description at [0.880, 0.003, 0.992, 0.949] : several people only visible from the feet.\n\nGlobal Caption:\nYoung woman with sheep on straw covered floor.\nA child places his hands on the head and neck of a sheep while another sheep looks at his face.\nA person petting the head of a cute fluffy sheep.\nA child is petting a sheep while another sheep watches.\nA woman kneeling to pet animals while others wait. "}
{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arrows at [0.000, 0.616, 0.214, 0.644].\nObject 1 : awning at [0.159, 0.260, 0.293, 0.336].\nObject 2 : building at [0.000, 0.000, 1.000, 0.466].\nObject 3 : bushes at [0.693, 0.342, 1.000, 0.512].\nObject 4 : door at [0.110, 0.370, 0.266, 0.518].\nObject 5 : face at [0.390, 0.256, 0.614, 0.392].\nObject 6 : greenery at [0.824, 0.154, 0.997, 0.384].\nObject 7 : hitch at [0.221, 0.520, 0.259, 0.542].\nObject 8 : ladder at [0.110, 0.342, 0.283, 0.364].\nObject 9 : license plate at [0.141, 0.460, 0.234, 0.500].\nObject 10 : line at [0.017, 0.700, 0.266, 0.756].\nObject 11 : picture at [0.155, 0.378, 0.259, 0.442].\nObject 12 : plant barrier at [0.672, 0.482, 1.000, 0.606].\nObject 13 : planter at [0.676, 0.152, 1.000, 0.510].\nObject 14 : pole at [0.328, 0.068, 0.483, 0.994].\nObject 15 : road at [0.000, 0.490, 1.000, 1.000].\nObject 16 : roof at [0.117, 0.360, 0.283, 0.382].\nObject 17 : sad face at [0.383, 0.244, 0.614, 0.384].\nObject 18 : short term at [0.624, 0.040, 0.769, 0.080].\nObject 19 : sidewalk at [0.666, 0.572, 0.993, 0.618].\nObject 20 : sign at [0.621, 0.082, 0.772, 0.132].\nObject 21 : sign at [0.007, 0.144, 0.069, 0.204].\nObject 22 : signal at [0.266, 0.210, 0.679, 0.848].\nObject 23 : stop light at [0.366, 0.236, 0.638, 0.394].\nObject 24 : tail light at [0.100, 0.446, 0.121, 0.472].\nObject 25 : van at [0.076, 0.326, 0.297, 0.556].\nObject 26 : wall at [0.676, 0.500, 0.997, 0.604].\nObject 27 : window at [0.903, 0.000, 1.000, 0.086].\n\nRelationships:\nobject 23 : stop light -> with -> object 17 : sad face.\nobject 0 : arrows -> on -> object 15 : road.\nobject 12 : plant barrier -> beside -> object 15 : road.\nobject 11 : picture -> on -> object 4 : door.\nobject 10 : line -> painted in -> object 15 : road.\nobject 19 : sidewalk -> next to -> object 15 : road.\nobject 2 : building -> for -> object 18 : short term.\nobject 23 : stop light -> making -> object 5 : face.\nobject 3 : bushes -> just above -> object 26 : wall.\nobject 22 : signal -> on -> object 14 : pole.\nobject 25 : van -> has -> object 16 : roof.\nobject 25 : van -> has -> object 8 : ladder.\nobject 8 : ladder -> on -> object 16 : roof.\nobject 13 : planter -> by -> object 15 : road.\nobject 23 : stop light -> on -> object 22 : signal.\n\nRegion Description:\nRegion Description at [0.331, 0.852, 0.472, 0.996] : Pole holding traffic light on street.\nRegion Description at [0.600, 0.036, 0.793, 0.084] : Building offers short term office space.\nRegion Description at [0.603, 0.074, 0.776, 0.120] : Office space as small as 2,500 sq. ft. available.\nRegion Description at [0.003, 0.008, 0.972, 0.356] : an office building is in the background.\n\nGlobal Caption:\nA red traffic light with a sad face drawn over it.\nA street scene with a close of of a stop light.\nA red stoplight with a street in the background.\nA stop sign gives traffic a frown face.\nThe sign is now at a red light."}
{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : back splash at [0.000, 0.000, 0.145, 0.278].\nObject 1 : blemish at [0.564, 0.178, 0.572, 0.184].\nObject 2 : blemish at [0.517, 0.214, 0.523, 0.222].\nObject 3 : checkered tile at [0.751, 0.226, 1.000, 0.422].\nObject 4 : checkered tile at [0.000, 0.000, 0.145, 0.282].\nObject 5 : cloth at [0.384, 0.878, 1.000, 1.002].\nObject 6 : face at [0.488, 0.106, 0.671, 0.270].\nObject 7 : faucet at [0.000, 0.170, 0.078, 0.268].\nObject 8 : floor at [0.000, 0.796, 1.000, 1.004].\nObject 9 : flooring at [0.000, 0.788, 1.000, 1.002].\nObject 10 : glove at [0.647, 0.622, 0.815, 0.780].\nObject 11 : hand at [0.633, 0.606, 0.815, 0.784].\nObject 12 : man at [0.000, 0.024, 0.835, 1.002].\nObject 13 : overalls at [0.000, 0.576, 0.702, 0.962].\nObject 14 : part at [0.043, 0.274, 0.133, 0.370].\nObject 15 : pipes at [0.000, 0.354, 0.046, 0.472].\nObject 16 : poster at [0.749, 0.226, 0.997, 0.426].\nObject 17 : seat at [0.581, 0.582, 1.000, 0.716].\nObject 18 : sill at [0.792, 0.032, 1.000, 0.094].\nObject 19 : sink at [0.000, 0.240, 0.136, 0.376].\nObject 20 : sock at [0.217, 0.856, 0.251, 0.892].\nObject 21 : tarp at [0.358, 0.868, 1.000, 1.004].\nObject 22 : tile at [0.749, 0.230, 1.000, 0.420].\nObject 23 : toilet at [0.564, 0.574, 1.000, 0.974].\nObject 24 : towel at [0.000, 0.872, 1.000, 1.002].\nObject 25 : wall at [0.000, 0.000, 1.000, 0.870].\nObject 26 : window at [0.777, 0.000, 1.000, 0.080].\n\nRelationships:\nobject 26 : window -> above -> object 23 : toilet.\nobject 21 : tarp -> to protect -> object 8 : floor.\nobject 14 : part -> of a bathroom -> object 19 : sink.\nobject 4 : checkered tile -> on bathroom -> object 25 : wall.\nobject 1 : blemish -> on -> object 6 : face.\nobject 2 : blemish -> on -> object 6 : face.\nobject 1 : blemish -> on -> object 6 : face.\nobject 6 : face -> on -> object 12 : man.\nobject 10 : glove -> on -> object 11 : hand.\n\nRegion Description:\nRegion Description at [0.685, 0.508, 0.879, 0.774] : the man is wearing gloves on his hands.\nRegion Description at [0.685, 0.638, 0.815, 0.764] : rubber glove on the man's right hand.\nRegion Description at [0.220, 0.860, 0.251, 0.886] : black and white design on man's sock.\nRegion Description at [0.000, 0.052, 0.124, 0.158] : black and white back splash for bathroom sink.\n\nGlobal Caption:\nA young man bending next to a toilet.\nA man is kneeling and holding on to a toilet.\nA man attempting to lift up a toilet off the floor.\nA man fixing a toilet in a black and white photo.\nA man wears gloves as he installs a toilet."}
{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : computer mouse at [0.414, 0.753, 0.470, 0.811].\nObject 1 : cup at [0.350, 0.783, 0.417, 0.906].\nObject 2 : desk at [0.000, 0.488, 0.998, 0.999].\nObject 3 : fork at [0.203, 0.794, 0.270, 0.857].\nObject 4 : glass at [0.277, 0.703, 0.345, 0.816].\nObject 5 : head phones at [0.872, 0.556, 0.993, 0.634].\nObject 6 : keyboard at [0.415, 0.620, 0.650, 0.783].\nObject 7 : lamp at [0.000, 0.302, 0.214, 0.430].\nObject 8 : laptop at [0.491, 0.296, 0.703, 0.540].\nObject 9 : picture at [0.795, 0.204, 0.898, 0.358].\nObject 10 : plant at [0.192, 0.201, 0.391, 0.461].\nObject 11 : plate at [0.183, 0.799, 0.326, 0.896].\nObject 12 : screen at [0.237, 0.249, 0.504, 0.628].\nObject 13 : stand at [0.506, 0.531, 0.663, 0.617].\nObject 14 : window at [0.606, 0.000, 1.000, 0.346].\n\nRelationships:\nobject 0 : computer mouse -> on -> object 2 : desk.\nobject 8 : laptop -> on -> object 13 : stand.\nobject 6 : keyboard -> on -> object 2 : desk.\nobject 9 : picture -> near -> object 14 : window.\nobject 3 : fork -> on -> object 11 : plate.\n\nRegion Description:\n\nGlobal Caption:\na desk with a cup plate laptop monitor and keyboard\nA laptop sitting next to a monitor, keyboard and a mouse.\nA laptop and a desktop monitor are displayed on top of the desk.\nLarge office desk with computers near a window.\nA desk with a laptop, second monitor and keyboard."}
{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : abs at [0.392, 0.628, 0.426, 0.664].\nObject 1 : arm at [0.416, 0.459, 0.432, 0.565].\nObject 2 : buildings at [0.242, 0.532, 0.640, 0.580].\nObject 3 : frisbee at [0.400, 0.354, 0.446, 0.381].\nObject 4 : grass at [0.000, 0.610, 0.998, 0.997].\nObject 5 : hand at [0.418, 0.423, 0.438, 0.474].\nObject 6 : legs at [0.420, 0.703, 0.456, 0.811].\nObject 7 : man at [0.390, 0.432, 0.466, 0.793].\nObject 8 : pants at [0.390, 0.658, 0.424, 0.763].\nObject 9 : shadow at [0.492, 0.724, 0.622, 0.994].\nObject 10 : shirt at [0.402, 0.468, 0.458, 0.649].\nObject 11 : sky at [0.002, 0.003, 0.996, 0.556].\nObject 12 : trees at [0.002, 0.498, 0.998, 0.646].\n\nRelationships:\nobject 7 : man -> tossing -> object 3 : frisbee.\nobject 7 : man -> has -> object 6 : legs.\nobject 7 : man -> playing -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> has -> object 5 : hand.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> jumping -> object 4 : grass.\nobject 2 : buildings -> behind -> object 4 : grass.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> extending -> object 1 : arm.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> exposing -> object 0 : abs.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 1 : arm -> reaching for -> object 3 : frisbee.\n\nRegion Description:\nRegion Description at [0.394, 0.658, 0.480, 0.826] : A person wearing black color trouser.\nRegion Description at [0.394, 0.435, 0.460, 0.796] : man in a red sweatshirt and jeans jumping.\nRegion Description at [0.390, 0.357, 0.464, 0.823] : man catching a frisbee in a wheat field.\nRegion Description at [0.012, 0.520, 0.996, 0.631] : trees and a village on a hill in the distance.\nRegion Description at [0.390, 0.423, 0.464, 0.649] : arm straight up and arm bent at elbow.\n\nGlobal Caption:\nA person trying to reach a Frisbee in a field with high brown grass.\nA young boy in a red top is playing with a red object tossed in the sky.\nA young man in a red jacket jumping for a Frizbee in a field.\nA guy is jumping to catch a frisbee in tall grass.\nA man jumps to catch a Frisbee flying through the air."}
{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blue sky at [0.000, 0.000, 0.998, 0.317].\nObject 1 : boat at [0.000, 0.461, 0.354, 0.579].\nObject 2 : boat at [0.348, 0.501, 0.874, 0.789].\nObject 3 : boat at [0.302, 0.461, 0.684, 0.611].\nObject 4 : buildings at [0.692, 0.195, 0.718, 0.248].\nObject 5 : buildings at [0.888, 0.173, 0.922, 0.227].\nObject 6 : buildings at [0.582, 0.211, 0.610, 0.256].\nObject 7 : buildings at [0.180, 0.259, 0.202, 0.293].\nObject 8 : buildings at [0.466, 0.208, 0.518, 0.272].\nObject 9 : chain-link fence at [0.002, 0.176, 0.996, 0.995].\nObject 10 : cord at [0.412, 0.587, 0.626, 1.000].\nObject 11 : fence pole at [0.230, 0.227, 0.336, 1.000].\nObject 12 : grass at [0.000, 0.667, 0.756, 0.997].\nObject 13 : horizon at [0.000, 0.187, 1.000, 0.336].\nObject 14 : mast at [0.570, 0.000, 0.722, 0.571].\nObject 15 : rack at [0.754, 0.168, 1.000, 0.901].\nObject 16 : sail post at [0.586, 0.000, 0.628, 0.568].\nObject 17 : section at [0.272, 0.179, 0.994, 0.992].\nObject 18 : shelf at [0.762, 0.355, 1.000, 0.387].\nObject 19 : sky line at [0.012, 0.173, 0.994, 0.195].\nObject 20 : surfboard at [0.830, 0.448, 0.996, 0.552].\nObject 21 : surfboard at [0.420, 0.384, 0.502, 0.411].\nObject 22 : surfboard at [0.910, 0.768, 0.998, 0.877].\nObject 23 : surfboard at [0.430, 0.344, 0.508, 0.371].\nObject 24 : surfboard at [0.830, 0.565, 1.000, 0.712].\nObject 25 : surfboard at [0.322, 0.307, 0.450, 0.341].\nObject 26 : surfboard at [0.766, 0.251, 0.998, 0.368].\nObject 27 : surfboard at [0.764, 0.704, 0.998, 0.829].\nObject 28 : water at [0.000, 0.259, 1.000, 0.469].\nObject 29 : water way at [0.008, 0.272, 0.996, 0.432].\n\nRelationships:\nobject 25 : surfboard -> stacked on -> object 18 : shelf.\nobject 24 : surfboard -> stacked on -> object 18 : shelf.\nobject 20 : surfboard -> stacked on -> object 18 : shelf.\nobject 26 : surfboard -> stacked on -> object 18 : shelf.\nobject 15 : rack -> of -> object 20 : surfboard.\nobject 8 : buildings -> on -> object 13 : horizon.\nobject 6 : buildings -> on -> object 13 : horizon.\nobject 4 : buildings -> on -> object 13 : horizon.\nobject 7 : buildings -> on -> object 13 : horizon.\nobject 5 : buildings -> on -> object 13 : horizon.\nobject 14 : mast -> on -> object 2 : boat.\nobject 9 : chain-link fence -> near -> object 29 : water way.\nobject 17 : section -> of -> object 9 : chain-link fence.\n\nRegion Description:\nRegion Description at [0.020, 0.187, 0.972, 0.963] : boats and surfboards behind wire fencing.\nRegion Description at [0.000, 0.160, 0.990, 0.349] : trees and buildings on other side of water.\nRegion Description at [0.340, 0.493, 0.852, 0.613] : white covering pulled over top of boat.\nRegion Description at [0.010, 0.667, 0.516, 0.995] : green bushes beside the chain link fence.\nRegion Description at [0.018, 0.213, 0.992, 0.995] : Black chain link fence enclosing boats..\nRegion Description at [0.242, 0.211, 0.302, 0.989] : Black fence pole holding chain link fence..\nRegion Description at [0.374, 0.499, 0.804, 0.803] : Yellow and white boat with sail pole..\nRegion Description at [0.014, 0.181, 0.998, 0.296] : Skyline of gray buildings in the background..\nRegion Description at [0.000, 0.664, 0.994, 0.976] : Green shrubs growing along side of a lake..\nRegion Description at [0.774, 0.216, 0.996, 0.944] : Boat parts on an outdoor shelving unit..\nRegion Description at [0.006, 0.013, 0.150, 0.285] : Sail masks with no flag attached to them..\n\nGlobal Caption:\nBoats docked on land sitting side by side next to a lake.\nA small harbor with boats docked and on racks\nA collection of boats behind a fence by a body of water.\nBoats and surfboards docked at a harbor bay.\n\nMany boats as seen through a chain link fence."}
{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm at [0.000, 0.125, 0.609, 0.988].\nObject 1 : bathroom tile at [0.009, 0.008, 0.994, 0.446].\nObject 2 : blue jeans at [0.369, 0.558, 0.722, 0.979].\nObject 3 : brush at [0.681, 0.208, 0.878, 0.500].\nObject 4 : brush holder at [0.716, 0.279, 0.891, 0.554].\nObject 5 : button at [0.519, 0.113, 0.584, 0.171].\nObject 6 : flusher at [0.534, 0.092, 0.628, 0.300].\nObject 7 : hand at [0.281, 0.125, 0.603, 0.562].\nObject 8 : holder at [0.713, 0.283, 0.903, 0.558].\nObject 9 : lid at [0.028, 0.046, 0.694, 0.446].\nObject 10 : man at [0.000, 0.133, 0.600, 0.992].\nObject 11 : seat at [0.138, 0.583, 0.722, 0.992].\nObject 12 : tank at [0.019, 0.021, 0.706, 0.579].\nObject 13 : tile at [0.794, 0.000, 1.000, 0.200].\nObject 14 : tile at [0.000, 0.000, 0.278, 0.129].\nObject 15 : toilet at [0.016, 0.042, 0.719, 0.996].\nObject 16 : toilet scrubber at [0.744, 0.192, 0.844, 0.521].\nObject 17 : toilet seat at [0.103, 0.517, 0.728, 0.996].\nObject 18 : wall at [0.659, 0.000, 0.978, 0.392].\nObject 19 : water at [0.369, 0.738, 0.500, 0.921].\n\nRelationships:\nobject 15 : toilet -> has -> object 11 : seat.\nobject 4 : brush holder -> by -> object 15 : toilet.\nobject 19 : water -> in -> object 15 : toilet.\nobject 6 : flusher -> on -> object 15 : toilet.\nobject 9 : lid -> on -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> has -> object 7 : hand.\nobject 0 : arm -> on -> object 15 : toilet.\nobject 14 : tile -> on -> object 18 : wall.\n\nRegion Description:\nRegion Description at [0.000, 0.046, 0.716, 0.987] : the arm reaching for the white toilet bowl.\nRegion Description at [0.716, 0.192, 0.894, 0.550] : the container and the toilet brush cleaner.\nRegion Description at [0.009, 0.042, 0.894, 0.992] : the toilet bowl next to the toilet bowl cleaner.\nRegion Description at [0.534, 0.087, 0.666, 0.329] : The hand is on the flusher in the image .\nRegion Description at [0.053, 0.158, 0.903, 0.875] : Porcelain toilet with flusher on top of the lid .\nRegion Description at [0.094, 0.154, 0.856, 0.942] : Man flushing the toilet in the bathroom .\n\nGlobal Caption:\nA hand is reaching out to the top if a toilet. \nA person flushing a toilet with a motion sensor.\nA person's hand flushing a toilet with a button on top of the tank. \na persons hand reaching for the top of a toilet\nA hand is reaching over a white toilet."}
{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blueberry at [0.306, 0.312, 0.400, 0.429].\nObject 1 : butter at [0.454, 0.024, 0.638, 0.288].\nObject 2 : cake at [0.238, 0.093, 0.786, 0.787].\nObject 3 : cup at [0.002, 0.000, 0.202, 0.667].\nObject 4 : cup at [0.140, 0.008, 0.336, 0.456].\nObject 5 : egg at [0.636, 0.125, 0.880, 0.267].\nObject 6 : food at [0.632, 0.123, 0.996, 0.336].\nObject 7 : lemon at [0.514, 0.728, 0.798, 0.997].\nObject 8 : melon at [0.308, 0.768, 0.658, 0.997].\nObject 9 : orange at [0.514, 0.733, 0.794, 0.997].\nObject 10 : parsley at [0.372, 0.515, 0.762, 0.965].\nObject 11 : plate at [0.166, 0.453, 1.000, 1.000].\nObject 12 : plate at [0.628, 0.120, 0.998, 0.389].\nObject 13 : sausage at [0.766, 0.248, 0.984, 0.333].\nObject 14 : spot at [0.766, 0.600, 0.790, 0.637].\nObject 15 : table at [0.002, 0.365, 0.998, 0.997].\nObject 16 : water at [0.000, 0.000, 0.202, 0.667].\n\nRelationships:\nobject 7 : lemon -> on -> object 11 : plate.\nobject 10 : parsley -> on -> object 11 : plate.\nobject 6 : food -> on -> object 12 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 11 : plate -> has -> object 14 : spot.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 13 : sausage -> on -> object 12 : plate.\nobject 0 : blueberry -> on -> object 2 : cake.\nobject 5 : egg -> on -> object 12 : plate.\nobject 8 : melon -> on -> object 11 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 2 : cake -> on -> object 11 : plate.\nobject 16 : water -> in -> object 3 : cup.\nobject 13 : sausage -> on -> object 12 : plate.\n\nRegion Description:\nRegion Description at [0.678, 0.104, 0.942, 0.424] : There is food on the plate in the back.\nRegion Description at [0.456, 0.013, 0.636, 0.307] : White frosting on top of a piece of cake.\nRegion Description at [0.322, 0.752, 0.650, 0.997] : square of honey dew on a white plate.\n\nGlobal Caption:\nA bluebery cake is on a plate and is topped with butter.\nA piece of cake with butter on it sits next to an orange slice. \nA large piece of blueberry cake on a plate.\nA plate of food attractively arranged on a table.\nA plate of blueberry coffee cake with butter and an orange slice on a table with breakfast foods."}
{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : napkins at [0.541, 0.818, 0.601, 0.858].\nObject 1 : pepper at [0.598, 0.836, 0.623, 0.860].\nObject 2 : post at [0.673, 0.494, 0.712, 0.926].\nObject 3 : restaurant sign at [0.548, 0.180, 0.779, 0.344].\nObject 4 : salt at [0.619, 0.838, 0.633, 0.850].\nObject 5 : shaker at [0.594, 0.822, 0.619, 0.854].\nObject 6 : shaker at [0.612, 0.824, 0.637, 0.854].\nObject 7 : table at [0.448, 0.834, 0.925, 0.998].\n\nRelationships:\nobject 4 : salt -> in -> object 6 : shaker.\nobject 0 : napkins -> on -> object 7 : table.\nobject 3 : restaurant sign -> on -> object 2 : post.\n\nRegion Description:\n\nGlobal Caption:\nThis is an empty table at a restaurant with ships in the background.\nThis table is covered by a blue Sam Adams umbrella\nAdvertising sign above a patio umbrella on sunny day.\nA lamp post stands next to an umbrella and table.\nAn umbrella is opened over an outdoor table."}
{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm rest at [0.674, 0.486, 0.722, 0.560].\nObject 1 : bench at [0.000, 0.324, 0.731, 0.994].\nObject 2 : bricks at [0.075, 0.850, 0.180, 0.882].\nObject 3 : building at [0.090, 0.000, 0.686, 0.094].\nObject 4 : children at [0.470, 0.302, 0.539, 0.360].\nObject 5 : coat at [0.473, 0.322, 0.542, 0.364].\nObject 6 : daytime at [0.000, 0.002, 0.997, 1.000].\nObject 7 : fence at [0.719, 0.310, 0.997, 0.372].\nObject 8 : grass at [0.000, 0.364, 0.997, 0.720].\nObject 9 : jacket at [0.012, 0.424, 0.485, 0.690].\nObject 10 : jeans at [0.165, 0.748, 0.293, 0.844].\nObject 11 : leg at [0.168, 0.750, 0.308, 0.844].\nObject 12 : people at [0.386, 0.438, 0.449, 0.504].\nObject 13 : purse at [0.458, 0.488, 0.605, 0.694].\nObject 14 : shoe at [0.192, 0.836, 0.305, 0.890].\nObject 15 : strap at [0.677, 0.470, 0.814, 0.584].\nObject 16 : trees at [0.554, 0.000, 0.997, 0.376].\nObject 17 : woman at [0.009, 0.194, 0.497, 0.888].\n\nRelationships:\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 13 : purse -> has a -> object 15 : strap.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 3 : building -> behind -> object 16 : trees.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 9 : jacket -> on -> object 17 : woman.\nobject 12 : people -> near -> object 16 : trees.\nobject 17 : woman -> has a -> object 11 : leg.\nobject 1 : bench -> has an -> object 0 : arm rest.\nobject 15 : strap -> from -> object 13 : purse.\nobject 2 : bricks -> near -> object 1 : bench.\nobject 16 : trees -> in -> object 6 : daytime.\nobject 7 : fence -> under -> object 16 : trees.\nobject 12 : people -> in front of -> object 7 : fence.\nobject 13 : purse -> on -> object 1 : bench.\nobject 14 : shoe -> on -> object 2 : bricks.\n\nRegion Description:\nRegion Description at [0.096, 0.006, 0.662, 0.074] : Building with brown and white facade.\nRegion Description at [0.374, 0.298, 0.542, 0.360] : two people walking in front of woman.\n\nGlobal Caption:\nA woman sitting on top of a wooden bench near a park.\nA person sits on a wooden bench facing blooming trees.\nA woman sitting on a wooden bench viewing some beautiful trees.\nAdult sitting on wooden park bench in large open space.\nA woman sits on a bench watching the park."}
{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : advertisement at [0.654, 0.400, 0.852, 0.555].\nObject 1 : area at [0.118, 0.379, 0.862, 0.787].\nObject 2 : back wheel at [0.432, 0.677, 0.490, 0.803].\nObject 3 : background at [0.004, 0.005, 0.998, 0.496].\nObject 4 : bottom at [0.488, 0.781, 0.526, 0.843].\nObject 5 : bus at [0.100, 0.376, 0.866, 0.805].\nObject 6 : door at [0.120, 0.456, 0.178, 0.696].\nObject 7 : front wheel at [0.172, 0.643, 0.204, 0.728].\nObject 8 : houses at [0.880, 0.344, 0.998, 0.483].\nObject 9 : light pole at [0.482, 0.005, 0.532, 0.840].\nObject 10 : list at [0.218, 0.395, 0.608, 0.461].\nObject 11 : message at [0.626, 0.184, 0.822, 0.328].\nObject 12 : name at [0.288, 0.629, 0.420, 0.731].\nObject 13 : person at [0.858, 0.560, 0.888, 0.715].\nObject 14 : pole at [0.680, 0.325, 0.708, 0.941].\nObject 15 : railing at [0.854, 0.589, 1.000, 0.704].\nObject 16 : sidewalk at [0.002, 0.688, 0.998, 1.000].\nObject 17 : sign at [0.578, 0.181, 0.826, 0.341].\nObject 18 : street at [0.000, 0.587, 0.998, 0.931].\nObject 19 : structure at [0.238, 0.293, 0.398, 0.424].\nObject 20 : symbol at [0.732, 0.427, 0.786, 0.469].\nObject 21 : tail lights at [0.812, 0.653, 0.860, 0.712].\nObject 22 : window at [0.342, 0.419, 0.424, 0.619].\nObject 23 : windows at [0.516, 0.392, 0.634, 0.627].\n\nRelationships:\nobject 10 : list -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 6 : door -> on -> object 5 : bus.\nobject 2 : back wheel -> of -> object 5 : bus.\nobject 7 : front wheel -> of -> object 5 : bus.\nobject 17 : sign -> on -> object 16 : sidewalk.\nobject 5 : bus -> on -> object 18 : street.\nobject 1 : area -> in -> object 3 : background.\nobject 8 : houses -> near -> object 5 : bus.\nobject 10 : list -> on -> object 5 : bus.\nobject 21 : tail lights -> on -> object 5 : bus.\nobject 5 : bus -> on -> object 18 : street.\nobject 4 : bottom -> of -> object 9 : light pole.\nobject 13 : person -> walking by -> object 5 : bus.\nobject 2 : back wheel -> on -> object 5 : bus.\nobject 17 : sign -> on -> object 18 : street.\nobject 22 : window -> of -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 8 : houses -> in -> object 3 : background.\nobject 14 : pole -> holding up -> object 17 : sign.\nobject 6 : door -> to -> object 5 : bus.\nobject 19 : structure -> in -> object 3 : background.\nobject 13 : person -> walking down -> object 16 : sidewalk.\nobject 15 : railing -> along -> object 16 : sidewalk.\nobject 17 : sign -> with -> object 11 : message.\nobject 6 : door -> of -> object 5 : bus.\nobject 14 : pole -> by -> object 18 : street.\nobject 14 : pole -> by -> object 5 : bus.\nobject 1 : area -> by -> object 16 : sidewalk.\nobject 13 : person -> walking across -> object 18 : street.\nobject 17 : sign -> attached to -> object 14 : pole.\nobject 11 : message -> on -> object 17 : sign.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 23 : windows -> on -> object 5 : bus.\nobject 6 : door -> on -> object 5 : bus.\nobject 5 : bus -> on -> object 18 : street.\n\nRegion Description:\nRegion Description at [0.576, 0.163, 0.838, 0.341] : street sign that reads All directions.\nRegion Description at [0.114, 0.323, 0.164, 0.448] : yellow and red structure in background.\nRegion Description at [0.580, 0.179, 0.838, 0.333] : a sign implying zero degrees equals 360 degrees.\n\nGlobal Caption:\na green and white bus is on the street\na public transit bus on a city street\nthe signs states all directions and points up\nAn empty city bus travels down a city street.\nA green and blue bus driving down a street."}
{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : eye at [0.566, 0.526, 0.592, 0.565].\nObject 1 : grass at [0.004, 0.808, 0.118, 0.991].\nObject 2 : grass at [0.206, 0.853, 0.356, 0.982].\nObject 3 : leg at [0.232, 0.375, 0.312, 0.805].\nObject 4 : plant at [0.500, 0.736, 0.618, 0.796].\nObject 5 : sitck at [0.746, 0.042, 0.912, 0.339].\nObject 6 : zebra at [0.000, 0.000, 0.622, 0.790].\nObject 7 : zebra at [0.002, 0.000, 0.684, 0.682].\n\nRelationships:\nobject 7 : zebra -> eating -> object 4 : plant.\nobject 6 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> grazing in -> object 1 : grass.\nobject 6 : zebra -> grazing in -> object 1 : grass.\n\nRegion Description:\nRegion Description at [0.352, 0.093, 0.602, 0.393] : thin line of hair running down the neck.\n\nGlobal Caption:\nA pair of zebra's leaning over eating grass in a field.\nTwo zebra stand near bushes and tall grass.\nTwo zebras grazing from grass next to a tree.\nTwo zebra standing next to each other on a lush green field.\nTwo zebras are feeding on the grass by themselves."}
{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bar at [0.444, 0.622, 0.640, 0.688].\nObject 1 : boots at [0.328, 0.339, 0.416, 0.492].\nObject 2 : bridal at [0.474, 0.246, 0.678, 0.432].\nObject 3 : food at [0.416, 0.646, 0.466, 0.715].\nObject 4 : foot at [0.324, 0.402, 0.380, 0.492].\nObject 5 : girl at [0.320, 0.078, 0.552, 0.502].\nObject 6 : grass at [0.012, 0.694, 0.998, 0.994].\nObject 7 : ground at [0.004, 0.679, 0.996, 0.913].\nObject 8 : helmet at [0.484, 0.096, 0.560, 0.162].\nObject 9 : hoof at [0.120, 0.853, 0.170, 0.925].\nObject 10 : horse at [0.024, 0.210, 0.690, 0.949].\nObject 11 : legs at [0.478, 0.453, 0.598, 0.637].\nObject 12 : legs at [0.130, 0.583, 0.278, 0.925].\nObject 13 : mane at [0.484, 0.186, 0.648, 0.279].\nObject 14 : person at [0.568, 0.568, 0.604, 0.640].\nObject 15 : poles at [0.460, 0.814, 0.538, 0.955].\nObject 16 : shirt at [0.580, 0.586, 0.594, 0.622].\nObject 17 : shirt at [0.388, 0.150, 0.508, 0.279].\nObject 18 : tail at [0.044, 0.357, 0.222, 0.784].\nObject 19 : tree at [0.720, 0.057, 0.874, 0.568].\nObject 20 : tree at [0.220, 0.000, 0.456, 0.586].\nObject 21 : trees at [0.730, 0.003, 0.986, 0.628].\nObject 22 : wall at [0.188, 0.276, 0.254, 0.393].\nObject 23 : water at [0.028, 0.468, 0.134, 0.574].\n\nRelationships:\nobject 5 : girl -> has -> object 1 : boots.\nobject 6 : grass -> under -> object 10 : horse.\nobject 21 : trees -> behind -> object 10 : horse.\nobject 10 : horse -> jumping -> object 15 : poles.\nobject 11 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 14 : person -> in -> object 16 : shirt.\nobject 10 : horse -> has -> object 9 : hoof.\n\nRegion Description:\n\nGlobal Caption:\nA young person ridding a horse jumps a gate in a competition.\nA man riding on a horse as it jumps over a pole. \nA woman is riding a horse as it jumps over a bar.\nthere is a woman jockey riding a hose over the hurdle\nA woman riding a horse jumps over an obstacle."}
{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : ball at [0.640, 0.399, 0.648, 0.411].\nObject 1 : border at [0.040, 0.502, 1.000, 0.556].\nObject 2 : boundary lines at [0.030, 0.661, 1.000, 1.000].\nObject 3 : bushes at [0.020, 0.186, 0.104, 0.517].\nObject 4 : fence at [0.008, 0.366, 0.994, 0.565].\nObject 5 : fence at [0.024, 0.502, 0.996, 0.709].\nObject 6 : grass at [0.004, 0.529, 0.994, 0.997].\nObject 7 : man at [0.144, 0.360, 0.246, 0.736].\nObject 8 : man at [0.730, 0.474, 0.780, 0.613].\nObject 9 : pants at [0.732, 0.529, 0.778, 0.604].\nObject 10 : shirt at [0.164, 0.411, 0.222, 0.547].\nObject 11 : shorts at [0.162, 0.535, 0.220, 0.628].\nObject 12 : sign at [0.916, 0.405, 0.934, 0.438].\nObject 13 : sky at [0.006, 0.021, 0.990, 0.279].\nObject 14 : sneakers at [0.180, 0.709, 0.216, 0.739].\nObject 15 : sneakers at [0.762, 0.598, 0.776, 0.613].\nObject 16 : tennis at [0.012, 0.384, 0.984, 0.934].\nObject 17 : tennis court at [0.000, 0.372, 0.988, 0.979].\nObject 18 : tennis racket at [0.768, 0.526, 0.808, 0.556].\nObject 19 : tennis racket at [0.214, 0.574, 0.238, 0.619].\nObject 20 : trees at [0.586, 0.282, 0.692, 0.420].\nObject 21 : white at [0.734, 0.492, 0.778, 0.601].\n\nRelationships:\nobject 7 : man -> in -> object 10 : shirt.\nobject 7 : man -> with -> object 19 : tennis racket.\nobject 7 : man -> plays -> object 16 : tennis.\nobject 7 : man -> wears -> object 14 : sneakers.\nobject 8 : man -> wears -> object 15 : sneakers.\nobject 7 : man -> wears -> object 11 : shorts.\nobject 8 : man -> wears -> object 9 : pants.\nobject 5 : fence -> has -> object 1 : border.\nobject 20 : trees -> behind -> object 3 : bushes.\nobject 2 : boundary lines -> on -> object 17 : tennis court.\nobject 2 : boundary lines -> on -> object 6 : grass.\nobject 3 : bushes -> behind -> object 4 : fence.\nobject 20 : trees -> behind -> object 4 : fence.\nobject 7 : man -> has -> object 19 : tennis racket.\nobject 8 : man -> wears -> object 21 : white.\nobject 4 : fence -> around -> object 17 : tennis court.\nobject 20 : trees -> behind -> object 8 : man.\nobject 6 : grass -> on -> object 17 : tennis court.\nobject 8 : man -> has -> object 18 : tennis racket.\nobject 8 : man -> hitting -> object 0 : ball.\nobject 5 : fence -> on -> object 17 : tennis court.\n\nRegion Description:\nRegion Description at [0.024, 0.489, 0.998, 0.730] : The tennis net separating the sides of the players..\nRegion Description at [0.144, 0.652, 0.234, 0.745] : The black sneakers the player is wearing..\nRegion Description at [0.720, 0.577, 0.784, 0.613] : The white sneakers the player is wearing..\nRegion Description at [0.158, 0.544, 0.230, 0.628] : The gray shorts the player is wearing..\nRegion Description at [0.006, 0.402, 0.998, 0.574] : The trimmed bushes behind the player..\nRegion Description at [0.008, 0.168, 0.998, 0.402] : The trees behind the trimmed bushes behind the player..\nRegion Description at [0.006, 0.604, 0.998, 0.985] : The white boundary lines on the tennis court..\nRegion Description at [0.020, 0.447, 0.994, 0.760] : A black and white net stretches across the field.\nRegion Description at [0.060, 0.526, 0.984, 0.985] : The field has green grass with white lines.\nRegion Description at [0.016, 0.369, 0.978, 0.595] : A tall green shrub is behind the fence.\nRegion Description at [0.034, 0.150, 0.984, 0.393] : Trees are seen behind the fence and shrub.\nRegion Description at [0.588, 0.327, 0.850, 0.703] : The yellow ball is flying towards the man.\nRegion Description at [0.902, 0.378, 0.956, 0.529] : A black circular sign with the number five.\nRegion Description at [0.142, 0.354, 0.248, 0.736] : male in white t-shirt playing tennis.\nRegion Description at [0.200, 0.565, 0.244, 0.625] : Head of tennis racket of man playing.\nRegion Description at [0.726, 0.465, 0.786, 0.631] : Man in white preparing to hit tennis ball.\n\nGlobal Caption:\nTwo men playing a game of tennis on a court.\ntwo people playing tennis with rackets on a grass court\nTwo young men playing a game of tennis.\nPeople playing tennis on a court surrounded by green hedges.\ntHERE ARE TWO MEN PLAYING TENNIS ON THE TENNIS COURT"}
{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : airline at [0.408, 0.420, 0.758, 0.502].\nObject 1 : airplane at [0.112, 0.300, 0.858, 0.640].\nObject 2 : engine at [0.652, 0.529, 0.730, 0.592].\nObject 3 : engine at [0.494, 0.502, 0.574, 0.577].\nObject 4 : fin at [0.208, 0.303, 0.320, 0.492].\nObject 5 : fin at [0.116, 0.480, 0.284, 0.526].\nObject 6 : front door at [0.752, 0.435, 0.772, 0.483].\nObject 7 : gear at [0.450, 0.592, 0.600, 0.643].\nObject 8 : letters at [0.694, 0.489, 0.732, 0.520].\nObject 9 : name at [0.398, 0.426, 0.760, 0.489].\nObject 10 : sky at [0.000, 0.000, 0.998, 1.000].\nObject 11 : window at [0.806, 0.438, 0.844, 0.456].\nObject 12 : windows at [0.326, 0.450, 0.750, 0.532].\nObject 13 : wing at [0.152, 0.426, 0.598, 0.538].\nObject 14 : wing at [0.116, 0.492, 0.282, 0.538].\n\nRelationships:\nobject 6 : front door -> of -> object 1 : airplane.\n\nRegion Description:\n\nGlobal Caption:\nAn airplane flying in the air during the day.\nA large aircraft is shown in the air.\nThe large jumbo jet has it's landing gear lowered.\nA large white airplane flies in the gray sky.\nAn airplane in route with a cloudy sky behind it."}
{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : awning at [0.886, 0.000, 1.000, 0.240].\nObject 1 : awning at [0.000, 0.299, 0.132, 0.403].\nObject 2 : bench at [0.000, 0.592, 0.066, 0.683].\nObject 3 : building at [0.000, 0.299, 0.204, 0.659].\nObject 4 : canopy at [0.000, 0.301, 0.136, 0.400].\nObject 5 : car at [0.290, 0.400, 0.998, 0.784].\nObject 6 : clouds at [0.374, 0.067, 0.920, 0.312].\nObject 7 : door opening at [0.658, 0.501, 0.682, 0.680].\nObject 8 : door opening at [0.678, 0.509, 0.710, 0.675].\nObject 9 : exterior at [0.000, 0.400, 0.200, 0.669].\nObject 10 : front at [0.294, 0.400, 0.494, 0.739].\nObject 11 : gravel at [0.090, 0.837, 0.334, 0.997].\nObject 12 : headlights at [0.416, 0.624, 0.446, 0.656].\nObject 13 : headlights at [0.300, 0.624, 0.324, 0.651].\nObject 14 : markings at [0.606, 0.821, 0.770, 0.928].\nObject 15 : panel at [0.304, 0.421, 0.450, 0.677].\nObject 16 : pole at [0.030, 0.419, 0.062, 0.656].\nObject 17 : railway tracks at [0.000, 0.752, 0.520, 0.944].\nObject 18 : side walk at [0.192, 0.712, 1.000, 0.997].\nObject 19 : sky at [0.000, 0.000, 0.998, 0.560].\nObject 20 : train stop at [0.000, 0.000, 1.000, 1.000].\nObject 21 : trees at [0.208, 0.253, 0.322, 0.653].\nObject 22 : trim at [0.000, 0.333, 0.132, 0.403].\nObject 23 : wall at [0.000, 0.392, 0.206, 0.611].\nObject 24 : wheel at [0.844, 0.669, 0.884, 0.728].\nObject 25 : wheel at [0.792, 0.675, 0.840, 0.747].\nObject 26 : wheel at [0.516, 0.691, 0.620, 0.808].\nObject 27 : window at [0.316, 0.451, 0.458, 0.595].\nObject 28 : windows at [0.700, 0.547, 0.848, 0.632].\nObject 29 : windsheild wipers at [0.348, 0.499, 0.410, 0.584].\n\nRelationships:\nobject 6 : clouds -> in -> object 19 : sky.\nobject 2 : bench -> in -> object 4 : canopy.\nobject 22 : trim -> on -> object 1 : awning.\nobject 11 : gravel -> next to -> object 17 : railway tracks.\nobject 14 : markings -> on side of -> object 18 : side walk.\nobject 5 : car -> on -> object 17 : railway tracks.\n\nRegion Description:\nRegion Description at [0.288, 0.392, 0.510, 0.741] : the front of the train is yellow and white.\nRegion Description at [0.320, 0.451, 0.460, 0.592] : the front window of the train has windshield wipers.\nRegion Description at [0.292, 0.592, 0.456, 0.739] : the headlights are on front of the train.\nRegion Description at [0.010, 0.405, 0.220, 0.736] : a red brick wall is near the platform.\nRegion Description at [0.000, 0.288, 0.128, 0.707] : an aluminum canopy is on the platform.\nRegion Description at [0.016, 0.325, 0.100, 0.672] : a red steel pole is holding up the awning.\nRegion Description at [0.306, 0.395, 0.998, 0.733] : the train has windowed passenger cars.\nRegion Description at [0.300, 0.427, 0.492, 0.693] : the yellow and white front of a train.\nRegion Description at [0.510, 0.744, 0.834, 0.891] : white painted line beside a train track.\nRegion Description at [0.298, 0.408, 0.468, 0.661] : a yellow panel on the front of the train.\nRegion Description at [0.002, 0.397, 0.210, 0.675] : a red brick building on the side of the tracks.\nRegion Description at [0.844, 0.000, 0.998, 0.248] : an awning of a structure next to the train tracks.\nRegion Description at [0.294, 0.360, 0.516, 0.787] : front of a train car in yellow, white and blue.\nRegion Description at [0.194, 0.221, 0.286, 0.901] : trees on the side of a train station.\nRegion Description at [0.580, 0.821, 0.764, 0.931] : markings on the side of railway tracks.\nRegion Description at [0.632, 0.491, 0.726, 0.691] : white, blue and grey doors on the side of a train car.\nRegion Description at [0.500, 0.096, 0.916, 0.531] : skyline on the side of a train station.\n\nGlobal Caption:\nFast commuter train moving past an outdoor platform.\nA train on the track pulling by a train station.\nA train pulling into a station outside during the day.\nA passenger train moving through a rail yard\na long passenger train pulling up to a station"}
{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : coat at [0.405, 0.332, 0.835, 0.746].\nObject 1 : decorative square at [0.000, 0.382, 1.000, 1.000].\nObject 2 : hat at [0.006, 0.162, 0.072, 0.198].\nObject 3 : jacket at [0.078, 0.222, 0.318, 0.430].\nObject 4 : jeans at [0.853, 0.422, 1.000, 0.632].\nObject 5 : leg at [0.853, 0.456, 0.928, 0.610].\nObject 6 : leg at [0.210, 0.458, 0.303, 0.638].\nObject 7 : leg at [0.000, 0.458, 0.060, 0.630].\nObject 8 : man at [0.066, 0.162, 0.318, 0.686].\nObject 9 : man at [0.850, 0.156, 1.000, 0.652].\nObject 10 : man at [0.390, 0.344, 0.838, 0.894].\nObject 11 : pants at [0.523, 0.736, 0.739, 0.858].\nObject 12 : person at [0.000, 0.162, 0.135, 0.668].\nObject 13 : person at [0.853, 0.154, 1.000, 0.650].\nObject 14 : section at [0.000, 0.134, 1.000, 1.000].\nObject 15 : sidewalk at [0.000, 0.388, 1.000, 1.000].\nObject 16 : umbrella at [0.168, 0.106, 0.910, 0.366].\nObject 17 : uniform at [0.000, 0.222, 0.126, 0.646].\nObject 18 : uniform at [0.105, 0.218, 0.318, 0.628].\n\nRelationships:\nobject 10 : man -> wearing -> object 11 : pants.\nobject 10 : man -> wearing -> object 0 : coat.\nobject 9 : man -> wearing -> object 4 : jeans.\nobject 8 : man -> wearing -> object 2 : hat.\nobject 8 : man -> wearing -> object 3 : jacket.\nobject 16 : umbrella -> has -> object 14 : section.\nobject 5 : leg -> of -> object 13 : person.\nobject 7 : leg -> of -> object 12 : person.\nobject 12 : person -> in -> object 17 : uniform.\n\nRegion Description:\nRegion Description at [0.066, 0.164, 0.318, 0.686] : the back of a man in a black uniform.\nRegion Description at [0.393, 0.324, 0.871, 0.766] : THIS MAN IS WEARING A LONG BLACK COAT.\nRegion Description at [0.468, 0.142, 0.634, 0.356] : THIS IS A RED SECTION ON THE UMBRELLA.\nRegion Description at [0.168, 0.140, 0.523, 0.292] : THIS IS A YELLOW SECTION ON THE UMBRELLA.\nRegion Description at [0.568, 0.138, 0.919, 0.232] : THIS IS A GREEN SECTION OF THE UMBRELLA.\n\nGlobal Caption:\nSeveral people walking on a sidewalk, with one man holding an umbrella.\nA person walking while carrying a rainbow umbrella\nA person is holding up a large colorful umbrella\na person walking down the street carrying a rainbow colored umbrella\nA person walking in a square carrying a rainbow colored umbrella."}
{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : back tire at [0.574, 0.496, 0.860, 0.800].\nObject 1 : bike at [0.146, 0.109, 0.938, 0.803].\nObject 2 : bike indicators at [0.238, 0.363, 0.264, 0.389].\nObject 3 : car at [0.000, 0.077, 0.086, 0.157].\nObject 4 : display at [0.240, 0.275, 0.290, 0.328].\nObject 5 : exhaust pipe at [0.460, 0.661, 0.818, 0.773].\nObject 6 : front tire at [0.146, 0.419, 0.366, 0.637].\nObject 7 : front wheel at [0.150, 0.424, 0.366, 0.635].\nObject 8 : garage door at [0.000, 0.000, 0.214, 0.341].\nObject 9 : handle at [0.284, 0.109, 0.390, 0.384].\nObject 10 : honda logo at [0.322, 0.395, 0.378, 0.419].\nObject 11 : house at [0.420, 0.000, 0.736, 0.149].\nObject 12 : leather seat at [0.496, 0.355, 0.792, 0.517].\nObject 13 : light at [0.894, 0.411, 0.944, 0.520].\nObject 14 : orange light at [0.280, 0.419, 0.296, 0.467].\nObject 15 : shock at [0.258, 0.477, 0.296, 0.568].\nObject 16 : shock absorber at [0.626, 0.501, 0.698, 0.680].\nObject 17 : shrubs at [0.628, 0.021, 0.764, 0.200].\nObject 18 : small windshield at [0.210, 0.120, 0.256, 0.291].\nObject 19 : sylencer at [0.462, 0.645, 0.816, 0.779].\nObject 20 : trees at [0.256, 0.003, 0.444, 0.205].\n\nRelationships:\nobject 1 : bike -> has -> object 7 : front wheel.\nobject 1 : bike -> has -> object 0 : back tire.\nobject 1 : bike -> has -> object 19 : sylencer.\nobject 1 : bike -> has -> object 16 : shock absorber.\nobject 1 : bike -> has -> object 13 : light.\nobject 9 : handle -> on -> object 1 : bike.\nobject 4 : display -> on -> object 1 : bike.\n\nRegion Description:\n\nGlobal Caption:\nA black Honda motorcycle parked in front of a garage.\nA Honda motorcycle parked in a grass driveway\nA black Honda motorcycle with a dark burgundy seat.\nMa motorcycle parked on the gravel in front of a garage\nA motorcycle with its brake extended standing outside"}
{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : background at [0.000, 0.000, 1.002, 0.997].\nObject 1 : bench at [0.604, 0.967, 0.672, 0.997].\nObject 2 : cow at [0.548, 0.860, 0.574, 0.896].\nObject 3 : cow at [0.436, 0.860, 0.454, 0.890].\nObject 4 : fence at [0.698, 0.949, 0.852, 0.997].\nObject 5 : moutain at [0.000, 0.057, 0.992, 0.782].\nObject 6 : pasture at [0.000, 0.815, 0.984, 1.000].\nObject 7 : peak at [0.744, 0.042, 0.898, 0.119].\nObject 8 : sky at [0.000, 0.000, 1.002, 0.257].\nObject 9 : snow at [0.210, 0.036, 0.962, 0.445].\nObject 10 : trailer at [0.796, 0.910, 0.894, 0.997].\nObject 11 : trailer at [0.632, 0.899, 0.742, 0.994].\nObject 12 : tree at [0.740, 0.409, 1.000, 0.982].\nObject 13 : tree at [0.638, 0.284, 0.652, 0.301].\n\nRelationships:\nobject 11 : trailer -> in -> object 6 : pasture.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 6 : pasture -> near -> object 5 : moutain.\nobject 3 : cow -> in -> object 6 : pasture.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 9 : snow -> on -> object 5 : moutain.\nobject 5 : moutain -> covered in -> object 9 : snow.\nobject 5 : moutain -> has -> object 7 : peak.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 5 : moutain -> in -> object 0 : background.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 11 : trailer -> near -> object 12 : tree.\nobject 5 : moutain -> has -> object 13 : tree.\nobject 7 : peak -> covered with -> object 9 : snow.\n\nRegion Description:\nRegion Description at [0.784, 0.901, 0.934, 0.991] : storage container for animal equipment.\nRegion Description at [0.828, 0.060, 0.880, 0.125] : The mountain is partially covered in snow..\nRegion Description at [0.840, 0.899, 0.920, 0.997] : horse trailer or cow trailer is silvertone, rectangular.\nRegion Description at [0.606, 0.919, 0.640, 0.982] : smaller trailer, white w/ brown+orange stripe.\nRegion Description at [0.060, 0.472, 0.540, 0.806] : a bare patch of earth amid lush green growth.\nRegion Description at [0.034, 0.839, 0.812, 0.973] : tiny cattle-containing fenceposts in the distance.\nRegion Description at [0.902, 0.827, 0.990, 0.997] : a split tree trunk in shadow, beneath leaves, shadow on ground.\nRegion Description at [0.734, 0.919, 0.802, 0.994] : an older station wagon/suv-type van thing.\nRegion Description at [0.090, 0.854, 0.124, 0.904] : a black & white animal stands alone, away from brown brethren, in the far distance.\n\nGlobal Caption:\nCows lounge in a field with a mountain backdrop.\nA VERY BIG MOUNTAIN AND ANIMALS SPREAD ACROSS A FARM.\nSeveral herd animals are on the grass by a mountain.\nCattle on a level pasture in a mountainous area.\nA bunch of cattle relax in a pasture located in the mountains"}
{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : animal at [0.358, 0.509, 0.434, 0.664].\nObject 1 : area at [0.464, 0.531, 0.934, 0.848].\nObject 2 : branches at [0.266, 0.107, 0.470, 0.272].\nObject 3 : bushes at [0.598, 0.424, 0.622, 0.445].\nObject 4 : grass at [0.544, 0.659, 0.840, 0.859].\nObject 5 : hill at [0.574, 0.323, 0.624, 0.376].\nObject 6 : leaves at [0.808, 0.293, 0.918, 0.360].\nObject 7 : license plate at [0.000, 0.691, 0.064, 0.747].\nObject 8 : light at [0.170, 0.557, 0.186, 0.632].\nObject 9 : park at [0.250, 0.192, 0.818, 0.664].\nObject 10 : road at [0.180, 0.709, 0.432, 0.957].\nObject 11 : sky at [0.448, 0.053, 0.532, 0.187].\nObject 12 : tire at [0.070, 0.728, 0.130, 0.795].\nObject 13 : tree at [0.000, 0.000, 0.478, 0.600].\nObject 14 : trees at [0.128, 0.000, 0.592, 0.597].\nObject 15 : truck at [0.000, 0.416, 0.210, 0.805].\nObject 16 : zebras at [0.730, 0.496, 0.796, 0.581].\n\nRelationships:\nobject 7 : license plate -> on -> object 15 : truck.\nobject 12 : tire -> on -> object 15 : truck.\nobject 5 : hill -> in -> object 9 : park.\nobject 0 : animal -> in -> object 1 : area.\nobject 13 : tree -> has -> object 6 : leaves.\nobject 0 : animal -> on -> object 1 : area.\nobject 15 : truck -> on -> object 10 : road.\nobject 10 : road -> with -> object 15 : truck.\nobject 3 : bushes -> on -> object 1 : area.\nobject 16 : zebras -> in -> object 1 : area.\nobject 2 : branches -> on -> object 13 : tree.\n\nRegion Description:\nRegion Description at [0.338, 0.480, 0.438, 0.680] : zebra watching in opposite direction.\n\nGlobal Caption:\nZebras are grazing on grass by a car.\nZebras are standing in a fenced in area.\nA herd of zebras stand under tress near a road. \nSeveral zebras are on the grass by a truck. \nA bunch of zebras grazing near a road where vehicles are driving by."}
{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : beach at [0.000, 0.720, 0.998, 1.000].\nObject 1 : boy at [0.792, 0.480, 0.938, 0.853].\nObject 2 : child at [0.322, 0.587, 0.376, 0.835].\nObject 3 : child at [0.320, 0.587, 0.374, 0.835].\nObject 4 : girl at [0.444, 0.539, 0.534, 0.856].\nObject 5 : man at [0.140, 0.443, 0.216, 0.845].\nObject 6 : man at [0.434, 0.459, 0.500, 0.760].\nObject 7 : man at [0.578, 0.459, 0.682, 0.845].\nObject 8 : ocean waters at [0.590, 0.419, 0.892, 0.629].\nObject 9 : people at [0.206, 0.456, 0.352, 0.851].\nObject 10 : person at [0.792, 0.480, 0.936, 0.851].\nObject 11 : shirt at [0.592, 0.496, 0.670, 0.629].\nObject 12 : shore at [0.000, 0.360, 0.998, 0.997].\nObject 13 : surfboard at [0.306, 0.709, 0.538, 0.853].\nObject 14 : surfboard at [0.790, 0.587, 0.960, 0.691].\nObject 15 : water at [0.384, 0.368, 0.544, 0.435].\nObject 16 : waves at [0.656, 0.709, 0.794, 0.779].\nObject 17 : wetsuit at [0.326, 0.629, 0.372, 0.773].\nObject 18 : woman at [0.208, 0.499, 0.304, 0.629].\n\nRelationships:\nobject 1 : boy -> holding -> object 14 : surfboard.\nobject 5 : man -> and -> object 18 : woman.\nobject 18 : woman -> and -> object 3 : child.\nobject 16 : waves -> coming to -> object 12 : shore.\nobject 7 : man -> looking down to -> object 15 : water.\nobject 2 : child -> with -> object 17 : wetsuit.\nobject 6 : man -> looking back to -> object 4 : girl.\nobject 4 : girl -> pulling -> object 13 : surfboard.\nobject 9 : people -> on -> object 0 : beach.\nobject 7 : man -> wearing -> object 11 : shirt.\n\nRegion Description:\nRegion Description at [0.096, 0.437, 0.970, 0.872] : Seven people headed to the water to surf..\nRegion Description at [0.390, 0.531, 0.540, 0.851] : Girl in yellow shirt and pony tail. .\nRegion Description at [0.312, 0.581, 0.374, 0.851] : Small child with red and black wetsuit..\nRegion Description at [0.578, 0.443, 0.688, 0.856] : Man with white shirt and grey wetsuit pants..\nRegion Description at [0.436, 0.440, 0.534, 0.872] : Man looking back to girl pulling surfboard..\nRegion Description at [0.444, 0.459, 0.552, 0.853] : A man and a little girl having a conversation.\nRegion Description at [0.104, 0.419, 0.314, 0.851] : A man and a woman walking toward the water.\n\nGlobal Caption:\nA group of people are taking surfing lessons.\nA group of men, women and children walking toward the water with surfboards.\nA mixed age group is going toward the ocean with surfboards.\nA group of surfers are carrying their surf boards into the ocean.\nSeveral people are getting ready to enter the water for surfing."}
{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : body at [0.166, 0.539, 0.296, 0.997].\nObject 1 : boot at [0.594, 0.753, 0.620, 0.870].\nObject 2 : boot at [0.620, 0.744, 0.658, 0.858].\nObject 3 : bucket at [0.268, 0.744, 0.322, 0.828].\nObject 4 : clouds at [0.156, 0.000, 0.968, 0.328].\nObject 5 : ear at [0.590, 0.226, 0.638, 0.410].\nObject 6 : ear at [0.368, 0.208, 0.448, 0.434].\nObject 7 : elephant at [0.328, 0.157, 0.638, 0.967].\nObject 8 : eye at [0.476, 0.319, 0.504, 0.346].\nObject 9 : foot at [0.436, 0.901, 0.516, 0.958].\nObject 10 : grass at [0.950, 0.759, 0.996, 0.807].\nObject 11 : leg at [0.498, 0.572, 0.548, 0.898].\nObject 12 : leg at [0.408, 0.512, 0.516, 0.955].\nObject 13 : man at [0.582, 0.476, 0.662, 0.870].\nObject 14 : man at [0.164, 0.455, 0.292, 0.997].\nObject 15 : mountains at [0.000, 0.265, 0.376, 0.470].\nObject 16 : rock at [0.736, 0.895, 0.762, 0.934].\nObject 17 : sand at [0.240, 0.687, 0.998, 1.000].\nObject 18 : shirt at [0.582, 0.521, 0.650, 0.681].\nObject 19 : shorts at [0.174, 0.699, 0.254, 0.864].\nObject 20 : side at [0.236, 0.675, 0.994, 0.997].\nObject 21 : skirt at [0.298, 0.687, 0.360, 0.810].\nObject 22 : sky at [0.004, 0.000, 0.998, 0.355].\nObject 23 : top at [0.302, 0.539, 0.358, 0.696].\nObject 24 : tree at [0.012, 0.407, 0.076, 0.500].\nObject 25 : trunk at [0.506, 0.392, 0.600, 0.964].\nObject 26 : watch at [0.172, 0.711, 0.192, 0.732].\nObject 27 : water at [0.000, 0.488, 0.994, 1.000].\nObject 28 : woman at [0.288, 0.473, 0.420, 0.967].\n\nRelationships:\nobject 7 : elephant -> on -> object 20 : side.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 14 : man -> standing on -> object 20 : side.\nobject 14 : man -> standing beside -> object 7 : elephant.\nobject 10 : grass -> on -> object 20 : side.\nobject 28 : woman -> wearing -> object 23 : top.\nobject 13 : man -> wearing -> object 18 : shirt.\nobject 13 : man -> wearing -> object 1 : boot.\nobject 13 : man -> wearing -> object 2 : boot.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 7 : elephant -> has -> object 25 : trunk.\nobject 14 : man -> wearing -> object 19 : shorts.\nobject 28 : woman -> petting -> object 7 : elephant.\nobject 14 : man -> with -> object 7 : elephant.\nobject 28 : woman -> with -> object 7 : elephant.\nobject 13 : man -> with -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 9 : foot -> of an -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 11 : leg -> of -> object 7 : elephant.\nobject 12 : leg -> of -> object 7 : elephant.\nobject 5 : ear -> of -> object 7 : elephant.\nobject 6 : ear -> of -> object 7 : elephant.\nobject 8 : eye -> of -> object 7 : elephant.\nobject 27 : water -> behind -> object 7 : elephant.\n\nRegion Description:\nRegion Description at [0.338, 0.139, 0.618, 0.967] : the elephant standing on the lake side.\nRegion Description at [0.154, 0.392, 0.300, 0.964] : a man standing on the lake side with shorts.\nRegion Description at [0.574, 0.422, 0.686, 0.910] : the man standing beside the elephant.\nRegion Description at [0.292, 0.485, 0.378, 0.705] : this lady is wearing a blue tank top.\nRegion Description at [0.722, 0.768, 0.988, 0.964] : the sand is brown with green grass growing in it.\nRegion Description at [0.156, 0.669, 0.270, 0.910] : the man is wearing grey black and white shorts.\nRegion Description at [0.504, 0.560, 0.568, 0.898] : The front right leg of the elephant..\nRegion Description at [0.310, 0.536, 0.358, 0.690] : The light blue tank top the girl is wearing..\nRegion Description at [0.262, 0.732, 0.326, 0.825] : The black bucket in the girl's hand..\nRegion Description at [0.002, 0.443, 0.992, 0.994] : The water behind the people and the elephant..\n\nGlobal Caption:\nA group of people are standing next to an elephant emerging from the water.\na group of people stand beside of a giant elephant \nThree tourists pose for a picture next to an elephant.\nThree people stand with an elephant in front of a stream.\nThree people standing next to an elephant along a river."}
{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : boat at [0.048, 0.552, 0.928, 0.819].\nObject 1 : building at [0.328, 0.493, 0.538, 0.613].\nObject 2 : building at [0.000, 0.467, 0.338, 0.651].\nObject 3 : building at [0.534, 0.096, 0.998, 0.637].\nObject 4 : canopies at [0.452, 0.504, 0.620, 0.600].\nObject 5 : container at [0.858, 0.643, 0.948, 0.712].\nObject 6 : dolphin at [0.282, 0.691, 0.344, 0.773].\nObject 7 : flag at [0.322, 0.563, 0.340, 0.597].\nObject 8 : ground at [0.822, 0.696, 0.880, 0.715].\nObject 9 : leaves at [0.002, 0.483, 0.080, 0.659].\nObject 10 : level at [0.000, 0.709, 1.000, 0.829].\nObject 11 : level at [0.068, 0.616, 0.852, 0.688].\nObject 12 : outdoor seating at [0.502, 0.579, 0.532, 0.624].\nObject 13 : pink writing at [0.414, 0.693, 0.654, 0.725].\nObject 14 : pole at [0.282, 0.416, 0.292, 0.515].\nObject 15 : railing at [0.094, 0.557, 0.728, 0.624].\nObject 16 : railing at [0.238, 0.597, 0.744, 0.627].\nObject 17 : reflection at [0.174, 0.808, 0.922, 0.848].\nObject 18 : roof at [0.000, 0.469, 0.280, 0.523].\nObject 19 : roof at [0.348, 0.509, 0.482, 0.568].\nObject 20 : roof at [0.920, 0.264, 0.980, 0.344].\nObject 21 : row at [0.700, 0.499, 0.878, 0.573].\nObject 22 : sea wall at [0.878, 0.712, 0.998, 0.819].\nObject 23 : shore at [0.000, 0.627, 0.996, 0.816].\nObject 24 : sky at [0.006, 0.000, 1.000, 0.517].\nObject 25 : steeple at [0.918, 0.088, 0.936, 0.237].\nObject 26 : symbol at [0.268, 0.688, 0.350, 0.779].\nObject 27 : symbol at [0.702, 0.693, 0.752, 0.725].\nObject 28 : tree at [0.472, 0.491, 0.592, 0.597].\nObject 29 : trees at [0.948, 0.573, 1.000, 0.691].\nObject 30 : trees at [0.000, 0.488, 0.080, 0.675].\nObject 31 : vehicle at [0.968, 0.653, 0.998, 0.693].\nObject 32 : water at [0.004, 0.813, 0.998, 0.992].\nObject 33 : water at [0.008, 0.717, 0.998, 0.981].\nObject 34 : window at [0.374, 0.733, 0.790, 0.765].\nObject 35 : window at [0.800, 0.491, 0.868, 0.576].\nObject 36 : window at [0.928, 0.512, 0.950, 0.576].\nObject 37 : window at [0.892, 0.395, 0.912, 0.443].\nObject 38 : window at [0.894, 0.517, 0.910, 0.571].\nObject 39 : window at [0.630, 0.493, 0.652, 0.565].\nObject 40 : windows at [0.384, 0.637, 0.724, 0.685].\n\nRelationships:\nobject 40 : windows -> on -> object 0 : boat.\nobject 17 : reflection -> in -> object 33 : water.\nobject 29 : trees -> growing on -> object 23 : shore.\nobject 30 : trees -> growing on -> object 23 : shore.\nobject 28 : tree -> growing on -> object 23 : shore.\nobject 18 : roof -> on -> object 2 : building.\nobject 5 : container -> on -> object 22 : sea wall.\nobject 0 : boat -> in -> object 32 : water.\nobject 0 : boat -> has -> object 15 : railing.\n\nRegion Description:\nRegion Description at [0.414, 0.691, 0.662, 0.725] : the are red letters on the side of the cruise ship.\nRegion Description at [0.370, 0.707, 0.780, 0.763] : there is a long set of black windows on the side of the cruise ship.\nRegion Description at [0.870, 0.243, 0.992, 0.357] : there is a red roof on this building.\nRegion Description at [0.538, 0.400, 0.712, 0.549] : there is red and gray building in the background.\nRegion Description at [0.054, 0.595, 0.312, 0.821] : there is two levels on this cruise ship.\nRegion Description at [0.370, 0.587, 0.664, 0.621] : there is a silver railing on the top level of the cruise ship.\nRegion Description at [0.858, 0.621, 0.952, 0.717] : there is a blue container on the dock.\nRegion Description at [0.876, 0.707, 0.996, 0.787] : there is a gray sea wall beside the ship.\nRegion Description at [0.268, 0.723, 0.346, 0.787] : there are blue water symbols on the side of the cruise ship.\nRegion Description at [0.000, 0.619, 0.024, 0.712] : there is a blue and white sign on the dock.\nRegion Description at [0.662, 0.533, 0.904, 0.603] : An outdoor canopy creates shade for customers. .\n\nGlobal Caption:\nA boat sits on the side of the dock.\nA large white boat in the open water.\nA white double decker boat n water next to buildings.\nA large cruise ship is traveling on the ocean. \nA Port River Dolphin Cruise ship sits in the water."}
{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : boutonniere at [0.710, 0.574, 0.799, 0.660].\nObject 1 : cake at [0.630, 0.670, 0.772, 0.750].\nObject 2 : cake crumb at [0.710, 0.348, 0.721, 0.356].\nObject 3 : crown at [0.370, 0.006, 0.549, 0.056].\nObject 4 : dress at [0.000, 0.574, 0.582, 1.000].\nObject 5 : eye at [0.649, 0.244, 0.699, 0.272].\nObject 6 : eye at [0.735, 0.264, 0.769, 0.280].\nObject 7 : eyebrow at [0.655, 0.230, 0.710, 0.250].\nObject 8 : eyebrow at [0.741, 0.252, 0.780, 0.264].\nObject 9 : finger at [0.721, 0.772, 0.816, 0.800].\nObject 10 : finger at [0.535, 0.740, 0.685, 0.826].\nObject 11 : ground at [0.003, 0.888, 0.997, 1.000].\nObject 12 : hair at [0.507, 0.142, 0.791, 0.642].\nObject 13 : hair at [0.189, 0.044, 0.652, 0.374].\nObject 14 : hand at [0.721, 0.720, 0.822, 0.818].\nObject 15 : hand at [0.493, 0.710, 0.685, 0.826].\nObject 16 : head at [0.209, 0.048, 0.652, 0.360].\nObject 17 : mouth at [0.646, 0.310, 0.724, 0.352].\nObject 18 : neck at [0.560, 0.344, 0.663, 0.460].\nObject 19 : necklace at [0.357, 0.334, 0.471, 0.484].\nObject 20 : necktie at [0.571, 0.442, 0.674, 0.936].\nObject 21 : paper at [0.760, 0.792, 0.914, 0.934].\nObject 22 : person at [0.490, 0.136, 0.825, 0.998].\nObject 23 : plate at [0.579, 0.734, 0.816, 0.768].\nObject 24 : purse at [0.774, 0.792, 0.883, 0.840].\nObject 25 : ring at [0.786, 0.780, 0.794, 0.796].\nObject 26 : shirt at [0.554, 0.376, 0.691, 0.950].\nObject 27 : suit jacket at [0.490, 0.422, 0.799, 0.998].\nObject 28 : table at [0.696, 0.816, 0.997, 0.916].\nObject 29 : toilet at [0.000, 0.656, 0.997, 0.936].\nObject 30 : wallpaper at [0.003, 0.000, 0.916, 0.656].\n\nRelationships:\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> sitting by -> object 29 : toilet.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 2 : cake crumb -> on side of -> object 17 : mouth.\nobject 24 : purse -> on top of -> object 28 : table.\nobject 5 : eye -> of a -> object 22 : person.\nobject 6 : eye -> of a -> object 22 : person.\nobject 7 : eyebrow -> of -> object 22 : person.\nobject 8 : eyebrow -> of -> object 22 : person.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 3 : crown -> on top of -> object 16 : head.\nobject 20 : necktie -> worn on -> object 22 : person.\nobject 22 : person -> holding -> object 1 : cake.\nobject 14 : hand -> holding -> object 1 : cake.\nobject 22 : person -> wearing -> object 27 : suit jacket.\nobject 22 : person -> wearing -> object 4 : dress.\nobject 20 : necktie -> worn on -> object 18 : neck.\nobject 13 : hair -> on top of -> object 16 : head.\nobject 1 : cake -> on top of -> object 23 : plate.\nobject 25 : ring -> worn on -> object 9 : finger.\n\nRegion Description:\nRegion Description at [0.022, 0.020, 0.203, 0.312] : A green and yellow striped wallpaper.\nRegion Description at [0.000, 0.048, 0.613, 0.996] : woman wearing a strapless white wedding dress .\nRegion Description at [0.487, 0.136, 0.808, 0.986] : woman white red hair holding a piece of cake on a plate.\nRegion Description at [0.543, 0.674, 0.813, 0.826] : woman's hands holding a plate of cake.\nRegion Description at [0.579, 0.124, 0.788, 0.524] : red haired woman wearing a tie and suit jacket .\nRegion Description at [0.000, 0.012, 0.819, 0.996] : two people wearing formal wedding attire .\n\nGlobal Caption:\nThere are two people enjoying a wedding reception\nA woman in a wedding dress with another woman in a suit behind\nA woman in a wedding dress with another lady holding a piece of cake.\nA red head girl holding a piece of cake\nA bride is with a long red haired person with cake."}
{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bush at [0.480, 0.000, 0.748, 0.084].\nObject 1 : ear at [0.544, 0.544, 0.571, 0.562].\nObject 2 : field at [0.000, 0.002, 0.994, 0.998].\nObject 3 : hill at [0.000, 0.000, 0.997, 0.998].\nObject 4 : plant at [0.000, 0.764, 0.601, 0.998].\nObject 5 : rock at [0.727, 0.410, 0.808, 0.470].\nObject 6 : sheep at [0.532, 0.546, 0.646, 0.662].\nObject 7 : sheep at [0.532, 0.666, 0.817, 0.810].\nObject 8 : tail at [0.565, 0.572, 0.604, 0.610].\nObject 9 : tree at [0.649, 0.000, 0.997, 0.334].\nObject 10 : trees at [0.736, 0.036, 0.835, 0.100].\nObject 11 : wall at [0.000, 0.000, 0.769, 0.180].\nObject 12 : weed at [0.417, 0.346, 0.492, 0.390].\n\nRelationships:\nobject 7 : sheep -> in a -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 11 : wall -> borders -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 10 : trees -> in -> object 2 : field.\nobject 6 : sheep -> has an -> object 1 : ear.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 12 : weed -> growing in -> object 2 : field.\nobject 7 : sheep -> on -> object 3 : hill.\nobject 4 : plant -> on -> object 2 : field.\nobject 5 : rock -> on -> object 3 : hill.\nobject 7 : sheep -> are in -> object 2 : field.\nobject 11 : wall -> running across -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 5 : rock -> in -> object 2 : field.\n\nRegion Description:\nRegion Description at [0.000, 0.072, 0.760, 0.160] : A stone wall boarding a field of sheep.\nRegion Description at [0.189, 0.032, 0.703, 0.178] : rocks and grass in the background of the pasture.\nRegion Description at [0.541, 0.662, 0.823, 0.802] : white sheep grazing in green grassy field.\nRegion Description at [0.538, 0.544, 0.646, 0.656] : white sheep grazing in green grassy field.\nRegion Description at [0.228, 0.374, 0.357, 0.436] : white sheep grazing in green grassy field.\nRegion Description at [0.607, 0.380, 0.712, 0.456] : white sheep grazing in green grassy field.\nRegion Description at [0.811, 0.296, 0.937, 0.338] : two white sheep grazing in green grassy field.\nRegion Description at [0.048, 0.200, 0.249, 0.242] : group of white sheep grazing in green grassy field.\nRegion Description at [0.213, 0.164, 0.336, 0.192] : group of white sheep grazing in green grassy field.\nRegion Description at [0.000, 0.006, 0.997, 0.172] : two long gray stone walls across field.\nRegion Description at [0.453, 0.000, 0.730, 0.062] : a stand of trees outside the stone fence.\n\nGlobal Caption:\nA group of sheep grazing in a grassy valley.\nSheep graze in a lushly green mountain meadow\nA flock of sheep walking along a grassy hillside grazing.\nA flock of sheep are grazing on a grassy slope.\nA group of sheep grazing in a grassy field."}
{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bold writings at [0.492, 0.770, 0.556, 0.810].\nObject 1 : bottle at [0.468, 0.642, 0.634, 0.916].\nObject 2 : cart at [0.232, 0.328, 0.808, 0.998].\nObject 3 : child at [0.408, 0.168, 0.606, 0.786].\nObject 4 : cleaner at [0.466, 0.634, 0.636, 0.916].\nObject 5 : floor at [0.000, 0.190, 1.000, 1.000].\nObject 6 : green shirt at [0.000, 0.180, 0.078, 0.540].\nObject 7 : houses at [0.000, 0.000, 0.240, 0.414].\nObject 8 : leaves at [0.894, 0.202, 0.910, 0.204].\nObject 9 : line at [0.796, 0.954, 0.996, 0.966].\nObject 10 : lines at [0.828, 0.398, 0.998, 0.568].\nObject 11 : metal at [0.514, 0.116, 0.558, 0.292].\nObject 12 : metal at [0.234, 0.336, 0.802, 0.998].\nObject 13 : metal part at [0.512, 0.862, 0.566, 0.992].\nObject 14 : pants at [0.432, 0.524, 0.574, 0.670].\nObject 15 : person at [0.110, 0.070, 0.258, 0.456].\nObject 16 : person at [0.412, 0.166, 0.604, 0.784].\nObject 17 : person at [0.000, 0.182, 0.216, 0.958].\nObject 18 : sandal at [0.070, 0.862, 0.180, 0.954].\nObject 19 : shirt at [0.128, 0.120, 0.216, 0.260].\nObject 20 : shorts at [0.140, 0.222, 0.216, 0.348].\nObject 21 : skirt at [0.000, 0.470, 0.214, 0.894].\nObject 22 : umbrella at [0.296, 0.038, 0.782, 0.360].\nObject 23 : woman at [0.286, 0.000, 0.802, 0.812].\nObject 24 : writings at [0.512, 0.838, 0.564, 0.868].\n\nRelationships:\nobject 3 : child -> holding -> object 22 : umbrella.\nobject 23 : woman -> pushing -> object 2 : cart.\nobject 21 : skirt -> on -> object 17 : person.\nobject 10 : lines -> on -> object 5 : floor.\nobject 20 : shorts -> on -> object 15 : person.\nobject 16 : person -> next to -> object 2 : cart.\nobject 16 : person -> wearing -> object 21 : skirt.\nobject 18 : sandal -> on -> object 17 : person.\nobject 6 : green shirt -> on -> object 16 : person.\nobject 14 : pants -> on -> object 3 : child.\n\nRegion Description:\nRegion Description at [0.298, 0.050, 0.778, 0.422] : the opened umbrella the child is holding.\n\nGlobal Caption:\nA baby girl standing in a shopping cart holding an umbrella.\nA GIRL IS IN A GROCERY CART \nA little girl is riding in a shopping cart while holding her umbrella.\nA little girl inside of a shopping cart.\nA small child stands in a shopping cart with an umbrella."}
{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bearded face at [0.371, 0.064, 0.393, 0.094].\nObject 1 : blender at [0.015, 0.165, 0.080, 0.307].\nObject 2 : box at [0.176, 0.249, 0.228, 0.329].\nObject 3 : buttons at [0.038, 0.268, 0.048, 0.275].\nObject 4 : counter at [0.567, 0.340, 0.738, 0.395].\nObject 5 : counter at [0.000, 0.329, 0.576, 0.398].\nObject 6 : curtain at [0.429, 0.048, 0.504, 0.318].\nObject 7 : curtain at [0.227, 0.000, 0.309, 0.287].\nObject 8 : dog at [0.462, 0.593, 0.568, 0.842].\nObject 9 : door knob at [0.242, 0.477, 0.253, 0.499].\nObject 10 : drawer at [0.112, 0.370, 0.259, 0.452].\nObject 11 : drawer at [0.284, 0.382, 0.394, 0.439].\nObject 12 : faucet at [0.338, 0.327, 0.388, 0.357].\nObject 13 : floor at [0.000, 0.713, 1.000, 1.000].\nObject 14 : kitchen at [0.000, 0.000, 0.750, 0.849].\nObject 15 : knob at [0.179, 0.398, 0.197, 0.422].\nObject 16 : knob at [0.340, 0.400, 0.352, 0.420].\nObject 17 : man at [0.274, 0.000, 0.517, 0.792].\nObject 18 : mugs at [0.509, 0.123, 0.595, 0.266].\nObject 19 : outlet at [0.107, 0.212, 0.143, 0.256].\nObject 20 : shoes at [0.391, 0.735, 0.476, 0.786].\nObject 21 : spatula at [0.126, 0.003, 0.153, 0.094].\nObject 22 : tile at [0.526, 0.592, 0.557, 0.634].\nObject 23 : wall at [0.003, 0.000, 0.220, 0.294].\nObject 24 : wall at [0.506, 0.019, 0.607, 0.384].\nObject 25 : window at [0.303, 0.016, 0.392, 0.328].\nObject 26 : wire at [0.097, 0.233, 0.129, 0.319].\n\nRelationships:\nobject 17 : man -> standing in -> object 14 : kitchen.\nobject 18 : mugs -> hanging on -> object 24 : wall.\nobject 1 : blender -> with -> object 3 : buttons.\nobject 17 : man -> with -> object 0 : bearded face.\nobject 26 : wire -> hanging from -> object 23 : wall.\nobject 8 : dog -> on -> object 13 : floor.\nobject 1 : blender -> on -> object 5 : counter.\nobject 6 : curtain -> on -> object 25 : window.\nobject 20 : shoes -> on -> object 17 : man.\n\nRegion Description:\nRegion Description at [0.056, 0.214, 0.140, 0.277] : A dark electric cord plugged into the wall.\nRegion Description at [0.000, 0.662, 0.116, 0.940] : A latter with onely one rung visible.\nRegion Description at [0.004, 0.698, 0.999, 0.991] : Durable Tan and brown laminent flooring.\nRegion Description at [0.004, 0.324, 0.739, 0.880] : cheap waferboard constructed cabinets .\nRegion Description at [0.514, 0.126, 0.588, 0.262] : convient and accessable way to store coffee mugs.\nRegion Description at [0.222, 0.001, 0.510, 0.286] : small window curtians with paisley design.\nRegion Description at [0.347, 0.053, 0.490, 0.312] : light weight flanel design mens shirt .\nRegion Description at [0.222, 0.004, 0.315, 0.303] : gold and white curtain on a kitchen window.\nRegion Description at [0.511, 0.126, 0.589, 0.261] : coffee cups hanging on the kitchen wall.\nRegion Description at [0.012, 0.149, 0.091, 0.340] : gold colored blinder sits on the counter.\nRegion Description at [-0.001, 0.000, 0.157, 0.122] : cooking utensils hanging against wall.\n\nGlobal Caption:\nA man standing next to a dog on the ground.\nA man is at a kitchen counter by a dog.\nAn man standing in a kitchen with a small puppy.\nthere is a small puppy on the kitchen floor\nA man in the kitchen standing with his dog."}
{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : block at [0.156, 0.630, 0.357, 0.822].\nObject 1 : book at [0.414, 0.208, 0.538, 0.364].\nObject 2 : book at [0.360, 0.202, 0.417, 0.360].\nObject 3 : book at [0.426, 0.484, 0.691, 0.522].\nObject 4 : book at [0.399, 0.404, 0.520, 0.554].\nObject 5 : bowl at [0.072, 0.030, 0.288, 0.076].\nObject 6 : center at [0.850, 0.732, 0.886, 0.766].\nObject 7 : eye at [0.282, 0.506, 0.327, 0.532].\nObject 8 : eye at [0.189, 0.506, 0.237, 0.534].\nObject 9 : flower at [0.796, 0.462, 0.982, 0.550].\nObject 10 : flower at [0.817, 0.528, 0.976, 0.612].\nObject 11 : flower at [0.760, 0.678, 0.946, 0.824].\nObject 12 : flower at [0.691, 0.608, 0.838, 0.722].\nObject 13 : flower at [0.913, 0.680, 1.000, 0.770].\nObject 14 : object at [0.213, 0.840, 0.583, 0.972].\nObject 15 : picture at [0.778, 0.060, 1.000, 0.352].\nObject 16 : shelf at [0.324, 0.528, 0.997, 0.624].\nObject 17 : shelf at [0.207, 0.334, 0.997, 0.380].\nObject 18 : shelf at [0.000, 0.028, 0.607, 0.202].\nObject 19 : stack at [0.435, 0.480, 0.712, 0.578].\nObject 20 : statue at [0.147, 0.404, 0.372, 0.652].\nObject 21 : table at [0.000, 0.690, 1.003, 0.998].\nObject 22 : vase at [0.838, 0.774, 0.994, 0.974].\nObject 23 : water at [0.847, 0.864, 0.997, 0.984].\n\nRelationships:\nobject 20 : statue -> on -> object 0 : block.\nobject 14 : object -> on -> object 21 : table.\nobject 1 : book -> on -> object 17 : shelf.\nobject 4 : book -> on -> object 16 : shelf.\nobject 5 : bowl -> on -> object 18 : shelf.\nobject 22 : vase -> has -> object 23 : water.\nobject 20 : statue -> has -> object 8 : eye.\nobject 20 : statue -> has -> object 7 : eye.\nobject 20 : statue -> on -> object 0 : block.\nobject 9 : flower -> in -> object 22 : vase.\nobject 10 : flower -> in -> object 22 : vase.\nobject 12 : flower -> in -> object 22 : vase.\nobject 13 : flower -> in -> object 22 : vase.\nobject 3 : book -> in -> object 19 : stack.\nobject 11 : flower -> has -> object 6 : center.\nobject 1 : book -> on -> object 17 : shelf.\nobject 2 : book -> on -> object 17 : shelf.\nobject 11 : flower -> has -> object 6 : center.\nobject 3 : book -> on -> object 19 : stack.\nobject 19 : stack -> on -> object 16 : shelf.\nobject 20 : statue -> on -> object 0 : block.\n\nRegion Description:\n\nGlobal Caption:\na glass vase with some flowers coming out of it \nA room witb a statue, bookshelves, books and a vase with flowers in it.\nA desk with a vase containing flowers, a sculpture of a man's head and shelves behind it.\nA statue next to a vase of flowers on a shelf. \nThe bust of a man's head is next to a vase of flowers."}
{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : area at [0.000, 0.002, 0.995, 0.996].\nObject 1 : background at [0.000, 0.132, 0.997, 0.268].\nObject 2 : cap at [0.171, 0.388, 0.253, 0.476].\nObject 3 : green/tennis court at [0.005, 0.720, 0.880, 0.994].\nObject 4 : hand at [0.253, 0.648, 0.299, 0.680].\nObject 5 : head at [0.173, 0.408, 0.256, 0.474].\nObject 6 : line at [0.397, 0.778, 0.501, 0.996].\nObject 7 : man at [0.163, 0.274, 0.491, 0.936].\nObject 8 : photo at [0.005, 0.004, 0.968, 0.976].\nObject 9 : pole at [0.019, 0.162, 0.035, 0.258].\nObject 10 : ses at [0.912, 0.962, 0.992, 0.994].\nObject 11 : shadow at [0.397, 0.898, 0.968, 0.956].\nObject 12 : shorts at [0.216, 0.628, 0.432, 0.782].\nObject 13 : sock at [0.325, 0.840, 0.376, 0.890].\nObject 14 : sport at [0.144, 0.270, 0.515, 0.944].\nObject 15 : tennis racket at [0.235, 0.578, 0.304, 0.664].\nObject 16 : tennis shoe at [0.213, 0.880, 0.280, 0.930].\nObject 17 : tennis shoe at [0.299, 0.886, 0.405, 0.936].\nObject 18 : trees at [0.269, 0.192, 0.995, 0.250].\nObject 19 : wrist at [0.384, 0.318, 0.429, 0.360].\nObject 20 : wristband at [0.384, 0.318, 0.432, 0.360].\n\nRelationships:\nobject 7 : man -> wearing -> object 12 : shorts.\nobject 4 : hand -> holding -> object 15 : tennis racket.\nobject 2 : cap -> on mans -> object 5 : head.\nobject 5 : head -> of a -> object 7 : man.\nobject 7 : man -> wearing a -> object 2 : cap.\nobject 7 : man -> wearing a -> object 13 : sock.\nobject 18 : trees -> in -> object 1 : background.\nobject 14 : sport -> in -> object 0 : area.\nobject 20 : wristband -> on a -> object 19 : wrist.\nobject 2 : cap -> on -> object 5 : head.\nobject 11 : shadow -> of -> object 7 : man.\nobject 12 : shorts -> on -> object 7 : man.\n\nRegion Description:\nRegion Description at [0.163, 0.322, 0.579, 0.926] : The tennis player is wearing all white.\nRegion Description at [0.397, 0.858, 0.936, 0.968] : Tennis player's shadow cast in front of him.\nRegion Description at [0.219, 0.560, 0.309, 0.680] : a black tennis racket in a man's hand.\nRegion Description at [0.341, 0.538, 0.480, 0.728] : a line judge at the side of a tennis court.\n\nGlobal Caption:\nA tennis player prepares to serve a tennis ball.\na tennis player in all white playing on a court \nA tennis player is reaching up with one arm and has a racquet in the other hand. \nThe tennis player throws the ball up to serve\nSpectators watching a man swinging at a tennis ball."}
{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bag at [0.428, 0.435, 0.476, 0.528].\nObject 1 : bag at [0.322, 0.923, 0.498, 0.997].\nObject 2 : building at [0.000, 0.003, 0.158, 0.413].\nObject 3 : face at [0.246, 0.240, 0.374, 0.483].\nObject 4 : flag at [0.044, 0.013, 0.090, 0.149].\nObject 5 : girl at [0.538, 0.019, 0.968, 0.949].\nObject 6 : hand at [0.176, 0.680, 0.304, 0.821].\nObject 7 : hands at [0.660, 0.344, 0.756, 0.517].\nObject 8 : head at [0.560, 0.003, 0.822, 0.339].\nObject 9 : hot dog at [0.676, 0.315, 0.882, 0.408].\nObject 10 : hot dogs at [0.190, 0.587, 0.350, 0.741].\nObject 11 : jeans at [0.586, 0.843, 0.916, 0.995].\nObject 12 : lady at [0.572, 0.045, 0.952, 0.984].\nObject 13 : logo at [0.920, 0.069, 0.996, 0.165].\nObject 14 : man at [0.486, 0.235, 0.564, 0.509].\nObject 15 : man at [0.456, 0.213, 0.520, 0.317].\nObject 16 : maroon shirt at [0.546, 0.333, 0.928, 0.944].\nObject 17 : mouth at [0.288, 0.408, 0.356, 0.440].\nObject 18 : people at [0.552, 0.029, 0.876, 0.995].\nObject 19 : post at [0.104, 0.005, 0.138, 0.533].\nObject 20 : purse at [0.842, 0.661, 0.980, 0.888].\nObject 21 : purse strap at [0.270, 0.893, 0.390, 0.992].\nObject 22 : shadow at [0.934, 0.067, 0.996, 0.141].\nObject 23 : side at [0.922, 0.875, 0.998, 0.997].\nObject 24 : street at [0.042, 0.403, 0.092, 0.520].\nObject 25 : sunglasses at [0.630, 0.005, 0.794, 0.048].\nObject 26 : woman at [0.502, 0.000, 0.982, 0.997].\nObject 27 : woman at [0.102, 0.099, 0.486, 0.984].\nObject 28 : woman's shirt at [0.518, 0.320, 0.944, 0.949].\n\nRelationships:\nobject 0 : bag -> on -> object 15 : man.\nobject 13 : logo -> on -> object 2 : building.\nobject 25 : sunglasses -> on -> object 26 : woman.\nobject 25 : sunglasses -> on -> object 8 : head.\nobject 4 : flag -> on -> object 19 : post.\nobject 6 : hand -> holds -> object 10 : hot dogs.\nobject 27 : woman -> has -> object 17 : mouth.\nobject 12 : lady -> holding -> object 9 : hot dog.\nobject 9 : hot dog -> in -> object 7 : hands.\nobject 18 : people -> crossing -> object 24 : street.\nobject 27 : woman -> wearing -> object 11 : jeans.\nobject 5 : girl -> wears -> object 16 : maroon shirt.\n\nRegion Description:\nRegion Description at [0.038, 0.173, 0.540, 0.995] : Laughing girl in a green shirt holding a hotdog..\nRegion Description at [0.504, 0.000, 0.954, 0.989] : Black haired girl in maroon shirt wearing sunglasses on her head..\nRegion Description at [0.508, 0.000, 0.960, 0.979] : Girl looking at the hot dog she's holding in her hands.\nRegion Description at [0.040, 0.173, 0.536, 0.981] : Girl holding hot dog in her right hand.\nRegion Description at [0.926, 0.253, 0.998, 0.645] : Woman in a brown shirt and jeans crossing the street.\nRegion Description at [0.202, 0.563, 0.334, 0.995] : Blue purse strap around woman's shoulder.\nRegion Description at [0.146, 0.587, 0.370, 0.787] : woman holding hot dog in white napkin.\nRegion Description at [0.682, 0.229, 0.742, 0.315] : woman's mouth open looking at hot dog.\nRegion Description at [0.234, 0.213, 0.396, 0.507] : woman's face smiling with eyes closed.\n\nGlobal Caption:\nTwo Asian women eating chili dogs while standing on a street.\nTwo women preparing to eat a hot dog on a city side.\nThe woman are eating their hot dogs while walking.\nTwo young women are eating hot dogs while walking down the sidewalk.\nTwo women eat chili dogs on a city sidewalk. "}
{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm at [0.658, 0.462, 0.828, 0.496].\nObject 1 : bench at [0.070, 0.493, 0.932, 0.960].\nObject 2 : concrete at [0.030, 0.810, 0.974, 0.997].\nObject 3 : foot at [0.724, 0.784, 0.782, 0.844].\nObject 4 : hair at [0.646, 0.367, 0.754, 0.472].\nObject 5 : hair at [0.564, 0.338, 0.652, 0.462].\nObject 6 : man at [0.542, 0.343, 0.812, 0.493].\nObject 7 : ocean at [0.028, 0.319, 0.972, 0.821].\nObject 8 : post at [0.090, 0.641, 0.102, 0.734].\nObject 9 : post at [0.924, 0.652, 0.944, 0.836].\nObject 10 : rail at [0.028, 0.620, 0.974, 0.660].\nObject 11 : seat at [0.072, 0.728, 0.928, 0.786].\nObject 12 : shoe at [0.720, 0.789, 0.782, 0.855].\nObject 13 : sky at [0.028, 0.037, 0.974, 0.325].\nObject 14 : slat at [0.072, 0.749, 0.928, 0.781].\nObject 15 : slat at [0.112, 0.499, 0.912, 0.522].\nObject 16 : slat at [0.126, 0.702, 0.912, 0.728].\nObject 17 : slat at [0.108, 0.594, 0.908, 0.625].\nObject 18 : slat at [0.106, 0.525, 0.908, 0.554].\nObject 19 : woman at [0.644, 0.377, 0.834, 0.863].\n\nRelationships:\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting with -> object 19 : woman.\nobject 6 : man -> has -> object 0 : arm.\nobject 0 : arm -> around -> object 19 : woman.\nobject 3 : foot -> wearing -> object 12 : shoe.\nobject 19 : woman -> has -> object 3 : foot.\nobject 3 : foot -> inside -> object 12 : shoe.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 19 : woman -> has -> object 4 : hair.\nobject 6 : man -> has -> object 5 : hair.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> backs up to -> object 1 : bench.\nobject 19 : woman -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 19 : woman -> relaxing on -> object 1 : bench.\nobject 6 : man -> relaxing on -> object 1 : bench.\nobject 19 : woman -> facing -> object 7 : ocean.\nobject 6 : man -> facing -> object 7 : ocean.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 6 : man -> relaxing with -> object 19 : woman.\nobject 6 : man -> on bench with -> object 19 : woman.\nobject 19 : woman -> resting on -> object 1 : bench.\nobject 6 : man -> resting on -> object 1 : bench.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 11 : seat -> part of -> object 1 : bench.\nobject 9 : post -> supporting -> object 10 : rail.\nobject 8 : post -> supporting -> object 10 : rail.\nobject 19 : woman -> has -> object 3 : foot.\nobject 12 : shoe -> belongs to -> object 19 : woman.\nobject 19 : woman -> has -> object 3 : foot.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 7 : ocean -> in front of -> object 1 : bench.\nobject 6 : man -> sitting next to -> object 19 : woman.\nobject 6 : man -> cuddling with -> object 19 : woman.\nobject 0 : arm -> around -> object 19 : woman.\nobject 6 : man -> silhouetted with -> object 19 : woman.\nobject 18 : slat -> part of -> object 1 : bench.\n\nRegion Description:\nRegion Description at [0.502, 0.309, 0.892, 0.512] : a man and woman looking at the ocean.\n\nGlobal Caption:\nTwo people sitting on a bench silhouetted against the sea.\nTwo people are sitting on a bench together in front of water.\nThe silhouette of two people sitting on a bench in front of the water.\nA couple is sitting on a bench in front of the water. \nA couple sits on a park bench and watches the water"}
{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : catcher at [0.334, 0.193, 0.756, 0.940].\nObject 1 : field at [0.000, 0.000, 0.998, 0.997].\nObject 2 : glove at [0.660, 0.492, 0.764, 0.674].\nObject 3 : hand at [0.666, 0.498, 0.748, 0.665].\nObject 4 : helmet at [0.472, 0.187, 0.610, 0.444].\nObject 5 : jersey at [0.340, 0.332, 0.556, 0.695].\nObject 6 : line at [0.396, 0.656, 0.560, 0.731].\nObject 7 : lines at [0.866, 0.927, 1.000, 0.997].\nObject 8 : lines at [0.754, 0.837, 0.998, 0.867].\nObject 9 : pads at [0.562, 0.668, 0.634, 0.782].\nObject 10 : pants at [0.336, 0.640, 0.612, 0.858].\nObject 11 : sneakers at [0.406, 0.834, 0.544, 0.946].\nObject 12 : stripe at [0.608, 0.737, 0.998, 0.795].\nObject 13 : wrist band at [0.586, 0.583, 0.604, 0.640].\n\nRelationships:\nobject 0 : catcher -> in -> object 1 : field.\nobject 2 : glove -> on -> object 3 : hand.\nobject 6 : line -> on -> object 10 : pants.\n\nRegion Description:\nRegion Description at [0.546, 0.625, 0.626, 0.801] : The player is wearing knee and leg pads..\nRegion Description at [0.018, 0.665, 0.280, 0.825] : A brown dirt ground surface on a baseball field.\nRegion Description at [0.676, 0.701, 0.974, 0.979] : White chalk lines painted on a baseball field.\nRegion Description at

Download .txt

gitextract_uqnovqeg/

├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── EVAL.md
├── LICENSE
├── README.md
├── experiments/
│   ├── ferret_13b_train.sh
│   └── ferret_7b_train.sh
├── ferret/
│   ├── __init__.py
│   ├── constants.py
│   ├── conversation.py
│   ├── eval/
│   │   ├── eval_flickr_entities.py
│   │   ├── eval_gpt_review_3newclass.py
│   │   ├── eval_lvis.py
│   │   ├── eval_pope.py
│   │   ├── eval_refexp.py
│   │   ├── ferret_gpt4_data/
│   │   │   ├── ground_conv/
│   │   │   │   ├── answer.jsonl
│   │   │   │   ├── context.jsonl
│   │   │   │   └── question.jsonl
│   │   │   ├── refer_caption/
│   │   │   │   ├── answer.jsonl
│   │   │   │   ├── context.jsonl
│   │   │   │   └── question.jsonl
│   │   │   ├── refer_reason/
│   │   │   │   ├── answer.jsonl
│   │   │   │   ├── context.jsonl
│   │   │   │   └── question.jsonl
│   │   │   └── rule.json
│   │   ├── gpt4_eval_script.sh
│   │   ├── model_flickr.py
│   │   ├── model_gpt4eval_3newclass.py
│   │   ├── model_lvis.py
│   │   ├── model_point_cls_single_image.py
│   │   ├── model_pope.py
│   │   ├── model_refcoco.py
│   │   └── summarize_gpt_review.py
│   ├── mm_utils.py
│   ├── model/
│   │   ├── __init__.py
│   │   ├── apply_delta.py
│   │   ├── builder.py
│   │   ├── consolidate.py
│   │   ├── ferret_arch.py
│   │   ├── language_model/
│   │   │   └── ferret_llama.py
│   │   ├── make_delta.py
│   │   ├── multimodal_encoder/
│   │   │   ├── builder.py
│   │   │   └── clip_encoder.py
│   │   └── utils.py
│   ├── serve/
│   │   ├── __init__.py
│   │   ├── controller.py
│   │   ├── dejavu/
│   │   │   └── .uuid
│   │   ├── gradio_css.py
│   │   ├── gradio_web_server.py
│   │   ├── model_worker.py
│   │   └── register_worker.py
│   ├── train/
│   │   ├── ferret_trainer.py
│   │   ├── llama_flash_attn_monkey_patch.py
│   │   ├── train.py
│   │   └── train_mem.py
│   └── utils.py
├── ferretui/
│   ├── README.md
│   ├── ferretui/
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── conversation.py
│   │   ├── eval/
│   │   │   ├── model_UI.py
│   │   │   ├── table/
│   │   │   │   ├── answer/
│   │   │   │   │   ├── answer_alpaca-13b.jsonl
│   │   │   │   │   ├── answer_bard.jsonl
│   │   │   │   │   ├── answer_gpt35.jsonl
│   │   │   │   │   ├── answer_llama-13b.jsonl
│   │   │   │   │   └── answer_vicuna-13b.jsonl
│   │   │   │   ├── caps_boxes_coco2014_val_80.jsonl
│   │   │   │   ├── model.jsonl
│   │   │   │   ├── prompt.jsonl
│   │   │   │   ├── question.jsonl
│   │   │   │   ├── results/
│   │   │   │   │   ├── test_sqa_llava_13b_v0.json
│   │   │   │   │   └── test_sqa_llava_lcs_558k_sqa_12e_vicuna_v1_3_13b.json
│   │   │   │   ├── review/
│   │   │   │   │   ├── review_alpaca-13b_vicuna-13b.jsonl
│   │   │   │   │   ├── review_bard_vicuna-13b.jsonl
│   │   │   │   │   ├── review_gpt35_vicuna-13b.jsonl
│   │   │   │   │   └── review_llama-13b_vicuna-13b.jsonl
│   │   │   │   ├── reviewer.jsonl
│   │   │   │   └── rule.json
│   │   │   └── webpage/
│   │   │       ├── index.html
│   │   │       ├── script.js
│   │   │       └── styles.css
│   │   ├── mm_utils.py
│   │   ├── model/
│   │   │   ├── __init__.py
│   │   │   ├── apply_delta.py
│   │   │   ├── builder.py
│   │   │   ├── consolidate.py
│   │   │   ├── ferret_arch.py
│   │   │   ├── language_model/
│   │   │   │   ├── ferret_gemma.py
│   │   │   │   ├── ferret_llama.py
│   │   │   │   └── ferret_mpt.py
│   │   │   ├── make_delta.py
│   │   │   ├── multimodal_encoder/
│   │   │   │   ├── builder.py
│   │   │   │   └── clip_encoder.py
│   │   │   ├── multimodal_projector/
│   │   │   │   └── builder.py
│   │   │   └── utils.py
│   │   ├── serve/
│   │   │   ├── __init__.py
│   │   │   ├── cli.py
│   │   │   ├── controller.py
│   │   │   ├── gradio_web_server.py
│   │   │   ├── model_worker.py
│   │   │   ├── register_worker.py
│   │   │   ├── sglang_worker.py
│   │   │   └── test_message.py
│   │   ├── train/
│   │   │   ├── ferret_trainer.py
│   │   │   ├── llama_flash_attn_monkey_patch.py
│   │   │   ├── llama_xformers_attn_monkey_patch.py
│   │   │   ├── train.py
│   │   │   ├── train_mem.py
│   │   │   └── train_xformers.py
│   │   └── utils.py
│   ├── playground/
│   │   └── sample_data/
│   │       ├── eval_data_example_0_box_in.json
│   │       ├── eval_data_example_1_no_box_in.json
│   │       └── train_data_example.json
│   ├── pyproject.toml
│   └── scripts/
│       ├── eval/
│       │   └── eval_UI.sh
│       ├── train/
│       │   └── train_UI.sh
│       ├── zero2.json
│       ├── zero3.json
│       └── zero3_offload.json
├── pyproject.toml
└── scripts/
    ├── extract_geosampler_and_mm_projector.py
    └── verify_equal.py

Download .txt

SYMBOL INDEX (572 symbols across 59 files)

FILE: ferret/conversation.py
  class SeparatorStyle (line 8) | class SeparatorStyle(Enum):
  class Conversation (line 18) | class Conversation:
    method get_prompt (line 33) | def get_prompt(self):
    method append_message (line 110) | def append_message(self, role, message):
    method get_images (line 113) | def get_images(self, return_pil=False):
    method to_gradio_chatbot (line 168) | def to_gradio_chatbot(self):
    method copy (line 202) | def copy(self):
    method dict (line 213) | def dict(self):

FILE: ferret/eval/eval_flickr_entities.py
  function resize_bbox (line 29) | def resize_bbox(box, image_w=None, image_h=None):
  function decode_bbox_from_caption (line 38) | def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
  function are_phrases_similar (line 71) | def are_phrases_similar(phrase1, phrase2):
  function get_sentence_data (line 91) | def get_sentence_data(filename) -> List[Dict[str, Any]]:
  function get_annotations (line 159) | def get_annotations(filename) -> Dict[str, Union[int, List[str], Dict[st...
  function box_area (line 220) | def box_area(boxes: np.array) -> np.array:
  function _box_inter_union (line 239) | def _box_inter_union(boxes1: np.array, boxes2: np.array) -> Tuple[np.arr...
  function box_iou (line 254) | def box_iou(boxes1: np.array, boxes2: np.array) -> np.array:
  function _merge_boxes (line 275) | def _merge_boxes(boxes: List[List[int]]) -> List[List[int]]:
  class RecallTracker (line 287) | class RecallTracker:
    method __init__ (line 290) | def __init__(self, topk: Sequence[int]):
    method add_positive (line 299) | def add_positive(self, k: int, category: str):
    method add_negative (line 306) | def add_negative(self, k: int, category: str):
    method report (line 312) | def report(self) -> Dict[int, Dict[str, float]]:
  class Flickr30kEntitiesRecallEvaluator (line 325) | class Flickr30kEntitiesRecallEvaluator:
    method __init__ (line 326) | def __init__(
    method evaluate (line 391) | def evaluate(self, predictions: List[Dict]):
  class Flickr30kEntitiesRecallEvaluatorFromJsonl (line 461) | class Flickr30kEntitiesRecallEvaluatorFromJsonl(Flickr30kEntitiesRecallE...
    method evaluate (line 462) | def evaluate(self,
    method summarize (line 560) | def summarize(self):

FILE: ferret/eval/eval_gpt_review_3newclass.py
  function get_eval (line 15) | def get_eval(content: str, max_tokens: int):
  function postprocess_answer (line 39) | def postprocess_answer(answer, category):
  function parse_score (line 64) | def parse_score(review):

FILE: ferret/eval/eval_lvis.py
  function get_args (line 19) | def get_args():
  function remove_not_phrases_v2 (line 24) | def remove_not_phrases_v2(text):

FILE: ferret/eval/eval_pope.py
  function evaluate_pope (line 20) | def evaluate_pope(prediction_file, annotation_file):

FILE: ferret/eval/eval_refexp.py
  function resize_bbox (line 29) | def resize_bbox(box, image_w=None, image_h=None):
  function decode_bbox_from_caption (line 38) | def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
  function are_phrases_similar (line 71) | def are_phrases_similar(phrase1, phrase2):
  class RefExpEvaluatorFromJsonl (line 91) | class RefExpEvaluatorFromJsonl(object):
    method __init__ (line 92) | def __init__(self, refexp_gt_path, k=(1, -1), thresh_iou=0.5):
    method summarize (line 102) | def summarize(self,

FILE: ferret/eval/model_flickr.py
  function split_list (line 48) | def split_list(lst, n):
  function get_chunk (line 54) | def get_chunk(lst, n, k):
  function plot_flickr (line 59) | def plot_flickr(img, boxes, entities, mode='pred'):
  function remove_punctuation (line 74) | def remove_punctuation(text: str) -> str:
  function resize_bbox (line 81) | def resize_bbox(box, image_w=None, image_h=None):
  function find_bbox_template (line 90) | def find_bbox_template(text, img_w, img_h):
  class FlickrGrounding (line 112) | class FlickrGrounding(torchvision.datasets.CocoDetection):
    method __init__ (line 113) | def __init__(self, img_folder, ann_file, transforms):
    method __getitem__ (line 118) | def __getitem__(self, idx):
  function eval_model_flickr (line 156) | def eval_model_flickr(args):

FILE: ferret/eval/model_gpt4eval_3newclass.py
  function split_list (line 36) | def split_list(lst, n):
  function get_chunk (line 42) | def get_chunk(lst, n, k):
  function generate_mask_for_feature (line 47) | def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
  class GPTEval_Data (line 74) | class GPTEval_Data():
    method __init__ (line 75) | def __init__(self, data_path, image_path, args) -> None:
    method ids (line 159) | def ids(self):
    method fetch_data (line 162) | def fetch_data(self, id):
  function eval_model (line 167) | def eval_model(args):

FILE: ferret/eval/model_lvis.py
  function split_list (line 44) | def split_list(lst, n):
  function get_chunk (line 50) | def get_chunk(lst, n, k):
  function generate_mask_for_feature (line 55) | def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
  class LVISData_V1 (line 82) | class LVISData_V1():
    method __init__ (line 83) | def __init__(self, data_path, image_path, args) -> None:
    method ids (line 156) | def ids(self):
    method fetch_data (line 159) | def fetch_data(self, id):
  function eval_model (line 164) | def eval_model(args):

FILE: ferret/eval/model_point_cls_single_image.py
  function generate_mask_for_feature (line 39) | def generate_mask_for_feature(coor,raw_w, raw_h):
  function eval_model (line 59) | def eval_model(args):

FILE: ferret/eval/model_pope.py
  function split_list (line 47) | def split_list(lst, n):
  function get_chunk (line 52) | def get_chunk(lst, n, k):
  function plot_pope (line 57) | def plot_pope(img, boxes, text):
  function resize_bbox (line 65) | def resize_bbox(box, image_w=None, image_h=None):
  function find_bbox_template_v3 (line 74) | def find_bbox_template_v3(text, img_w, img_h):
  class PopeGrounding (line 99) | class PopeGrounding():
    method __init__ (line 100) | def __init__(self, img_folder, ann_file):
    method __getitem__ (line 107) | def __getitem__(self, idx):
    method ids (line 116) | def ids(self):
  function eval_model_pope (line 120) | def eval_model_pope(args):

FILE: ferret/eval/model_refcoco.py
  function split_list (line 47) | def split_list(lst, n):
  function get_chunk (line 53) | def get_chunk(lst, n, k):
  function plot_refexp (line 58) | def plot_refexp(img, boxes, entities, mode='pred'):
  function remove_punctuation (line 70) | def remove_punctuation(text: str) -> str:
  function resize_bbox (line 77) | def resize_bbox(box, image_w=None, image_h=None):
  function find_bbox_template (line 86) | def find_bbox_template(text, img_w, img_h):
  class RefExpGrounding (line 115) | class RefExpGrounding(torchvision.datasets.CocoDetection):
    method __init__ (line 116) | def __init__(self, img_folder, ann_file, transforms):
    method __getitem__ (line 121) | def __getitem__(self, idx):
  function eval_model_refexp (line 149) | def eval_model_refexp(args):

FILE: ferret/eval/summarize_gpt_review.py
  function parse_args (line 9) | def parse_args():

FILE: ferret/mm_utils.py
  function load_image_from_base64 (line 10) | def load_image_from_base64(image):
  function process_images (line 14) | def process_images(images, image_processor, model_cfg):
  function tokenizer_image_token (line 18) | def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOK...
  function get_model_name_from_path (line 40) | def get_model_name_from_path(model_path):
  class KeywordsStoppingCriteria (line 51) | class KeywordsStoppingCriteria(StoppingCriteria):
    method __init__ (line 52) | def __init__(self, keywords, tokenizer, input_ids):
    method __call__ (line 63) | def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTe...

FILE: ferret/model/apply_delta.py
  function apply_delta (line 35) | def apply_delta(base_model_path, target_model_path, delta_path):

FILE: ferret/model/builder.py
  function load_pretrained_model (line 24) | def load_pretrained_model(model_path, model_base, model_name, load_8bit=...

FILE: ferret/model/consolidate.py
  function consolidate_ckpt (line 13) | def consolidate_ckpt(src_path, dst_path):

FILE: ferret/model/ferret_arch.py
  function rand_sample (line 27) | def rand_sample(x, max_len):
  function rand_sample_repeat (line 34) | def rand_sample_repeat(x, max_len):
  function point_sample (line 45) | def point_sample(input, point_coords, return_dtype, **kwargs):
  function farthest_point_sample (line 73) | def farthest_point_sample(xyz, npoint):
  function index_points (line 96) | def index_points(points, idx):
  function square_distance (line 115) | def square_distance(src, dst):
  function knn_point (line 137) | def knn_point(nsample, xyz, new_xyz):
  class ConvReLULN1D (line 151) | class ConvReLULN1D(nn.Module):
    method __init__ (line 152) | def __init__(self, in_channels, out_channels, kernel_size=1, bias=True):
    method forward (line 161) | def forward(self, x):
  function normal_init (line 171) | def normal_init(module, mean=0, std=1, bias=0):
  class GeoRegionSampler (line 178) | class GeoRegionSampler(nn.Module):
    method __init__ (line 179) | def __init__(self,
    method norm_init_weights (line 215) | def norm_init_weights(self):
    method forward (line 221) | def forward(self,
  class FERRETMetaModel (line 333) | class FERRETMetaModel:
    method __init__ (line 335) | def __init__(self, config):
    method get_vision_tower (line 356) | def get_vision_tower(self):
    method initialize_vision_modules (line 362) | def initialize_vision_modules(self, model_args, fsdp=None, add_region_...
  class FERRETMetaForCausalLM (line 411) | class FERRETMetaForCausalLM(ABC):
    method get_model (line 414) | def get_model(self):
    method get_vision_tower (line 417) | def get_vision_tower(self):
    method encode_images (line 420) | def encode_images(self, images, region_flag=False, region_geo_sampler=...
    method extract_region_feature (line 435) | def extract_region_feature(self, region_feature_map, region_masks, ori...
    method prepare_inputs_labels_for_multimodal (line 471) | def prepare_inputs_labels_for_multimodal(
    method initialize_vision_tokenizer (line 626) | def initialize_vision_tokenizer(self, model_args, tokenizer, add_regio...

FILE: ferret/model/language_model/ferret_llama.py
  class FERRETConfig (line 28) | class FERRETConfig(LlamaConfig):
  class FERRETLlamaModel (line 32) | class FERRETLlamaModel(FERRETMetaModel, LlamaModel):
    method __init__ (line 35) | def __init__(self, config: LlamaConfig):
  class FERRETLlamaForCausalLM (line 39) | class FERRETLlamaForCausalLM(LlamaForCausalLM, FERRETMetaForCausalLM):
    method __init__ (line 42) | def __init__(self, config):
    method get_model (line 51) | def get_model(self):
    method forward (line 54) | def forward(
    method prepare_inputs_for_generation (line 116) | def prepare_inputs_for_generation(

FILE: ferret/model/make_delta.py
  function make_delta (line 35) | def make_delta(base_model_path, target_model_path, delta_path, hub_repo_...

FILE: ferret/model/multimodal_encoder/builder.py
  function build_vision_tower (line 5) | def build_vision_tower(vision_tower_cfg, **kwargs):

FILE: ferret/model/multimodal_encoder/clip_encoder.py
  class CLIPImageProcessor_GIT (line 17) | class CLIPImageProcessor_GIT(CLIPImageProcessor):
    method resize (line 18) | def resize(
  class CLIPVisionTower (line 48) | class CLIPVisionTower(nn.Module):
    method __init__ (line 49) | def __init__(self, vision_tower, args, delay_load=False):
    method load_model (line 63) | def load_model(self, vision_tower_path=None):
    method feature_select (line 74) | def feature_select(self, image_forward_outs):
    method forward (line 85) | def forward(self, images):
    method dummy_feature (line 99) | def dummy_feature(self):
    method dtype (line 103) | def dtype(self):
    method device (line 107) | def device(self):
    method config (line 111) | def config(self):
    method hidden_size (line 118) | def hidden_size(self):
    method num_patches (line 122) | def num_patches(self):

FILE: ferret/model/utils.py
  function auto_upgrade (line 4) | def auto_upgrade(config):

FILE: ferret/serve/controller.py
  class DispatchMethod (line 28) | class DispatchMethod(Enum):
    method from_str (line 33) | def from_str(cls, name):
  class WorkerInfo (line 43) | class WorkerInfo:
  function heart_beat_controller (line 51) | def heart_beat_controller(controller):
  class Controller (line 57) | class Controller:
    method __init__ (line 58) | def __init__(self, dispatch_method: str):
    method register_worker (line 69) | def register_worker(self, worker_name: str, check_heart_beat: bool,
    method get_worker_status (line 88) | def get_worker_status(self, worker_name: str):
    method remove_worker (line 101) | def remove_worker(self, worker_name: str):
    method refresh_all_workers (line 104) | def refresh_all_workers(self):
    method list_models (line 112) | def list_models(self):
    method get_worker_address (line 120) | def get_worker_address(self, model_name: str):
    method receive_heart_beat (line 173) | def receive_heart_beat(self, worker_name: str, queue_length: int):
    method remove_stable_workers_by_expiration (line 183) | def remove_stable_workers_by_expiration(self):
    method worker_api_generate_stream (line 193) | def worker_api_generate_stream(self, params):
    method worker_api_get_status (line 220) | def worker_api_get_status(self):
  function register_worker (line 243) | async def register_worker(request: Request):
  function refresh_all_workers (line 251) | async def refresh_all_workers():
  function list_models (line 256) | async def list_models():
  function get_worker_address (line 262) | async def get_worker_address(request: Request):
  function receive_heart_beat (line 269) | async def receive_heart_beat(request: Request):
  function worker_api_generate_stream (line 277) | async def worker_api_generate_stream(request: Request):
  function worker_api_get_status (line 284) | async def worker_api_get_status(request: Request):

FILE: ferret/serve/gradio_web_server.py
  function generate_mask_for_feature (line 53) | def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
  function draw_box (line 80) | def draw_box(coor, region_mask, region_ph, img, input_mode):
  function get_conv_log_filename (line 115) | def get_conv_log_filename():
  function get_model_list (line 121) | def get_model_list():
  function load_demo (line 141) | def load_demo(url_params, request: gr.Request):
  function load_demo_refresh_model_list (line 161) | def load_demo_refresh_model_list(request: gr.Request):
  function vote_last_response (line 175) | def vote_last_response(state, vote_type, model_selector, request: gr.Req...
  function upvote_last_response (line 187) | def upvote_last_response(state, model_selector, request: gr.Request):
  function downvote_last_response (line 193) | def downvote_last_response(state, model_selector, request: gr.Request):
  function flag_last_response (line 199) | def flag_last_response(state, model_selector, request: gr.Request):
  function regenerate (line 205) | def regenerate(state, image_process_mode, request: gr.Request):
  function clear_history (line 215) | def clear_history(request: gr.Request):
  function resize_bbox (line 222) | def resize_bbox(box, image_w=None, image_h=None, default_wh=VOCAB_IMAGE_W):
  function show_location (line 231) | def show_location(sketch_pad, chatbot):
  function add_text (line 274) | def add_text(state, text, image_process_mode, original_image, sketch_pad...
  function post_process_code (line 313) | def post_process_code(code):
  function find_indices_in_order (line 324) | def find_indices_in_order(str_list, STR):
  function format_region_prompt (line 337) | def format_region_prompt(prompt, refer_input_state):
  function http_bot (line 349) | def http_bot(state, model_selector, temperature, top_p, max_new_tokens, ...
  class ImageMask (line 519) | class ImageMask(gr.components.Image):
    method __init__ (line 526) | def __init__(self, **kwargs):
    method preprocess (line 529) | def preprocess(self, x):
  function draw (line 533) | def draw(input_mode, input, refer_input_state, refer_text_show, imagebox...
  function build_demo (line 611) | def build_demo(embed_mode):

FILE: ferret/serve/model_worker.py
  function heart_beat_worker (line 44) | def heart_beat_worker(controller):
  class ModelWorker (line 51) | class ModelWorker:
    method __init__ (line 52) | def __init__(self, controller_addr, worker_addr,
    method register_to_controller (line 90) | def register_to_controller(self):
    method send_heart_beat (line 102) | def send_heart_beat(self):
    method get_queue_length (line 123) | def get_queue_length(self):
    method get_status (line 130) | def get_status(self):
    method generate_stream (line 138) | def generate_stream(self, params):
    method generate_stream_gate (line 268) | def generate_stream_gate(self, params):
  function release_model_semaphore (line 298) | def release_model_semaphore(fn=None):
  function generate_stream (line 305) | async def generate_stream(request: Request):
  function get_status (line 321) | async def get_status(request: Request):

FILE: ferret/train/ferret_trainer.py
  function maybe_zero_3 (line 8) | def maybe_zero_3(param, ignore_status=False, name=None):
  function get_mm_adapter_state_maybe_zero_3 (line 22) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
  class FERRETTrainer (line 28) | class FERRETTrainer(Trainer):
    method _save_checkpoint (line 30) | def _save_checkpoint(self, model, trial, metrics=None):
    method _save (line 62) | def _save(self, output_dir: Optional[str] = None, state_dict=None):

FILE: ferret/train/llama_flash_attn_monkey_patch.py
  function forward (line 15) | def forward(
  function _prepare_decoder_attention_mask (line 94) | def _prepare_decoder_attention_mask(self, attention_mask, input_shape,
  function replace_llama_attn_with_flash_attn (line 100) | def replace_llama_attn_with_flash_attn():

FILE: ferret/train/train.py
  function rank0_print (line 55) | def rank0_print(*args):
  class ModelArguments (line 61) | class ModelArguments:
  class DataArguments (line 80) | class DataArguments:
  class TrainingArguments (line 99) | class TrainingArguments(transformers.TrainingArguments):
  function maybe_zero_3 (line 132) | def maybe_zero_3(param, ignore_status=False, name=None):
  function get_peft_state_maybe_zero_3 (line 147) | def get_peft_state_maybe_zero_3(named_params, bias):
  function get_peft_state_non_lora_maybe_zero_3 (line 172) | def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only...
  function get_mm_adapter_state_maybe_zero_3 (line 180) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
  function find_all_linear_names (line 186) | def find_all_linear_names(model):
  function safe_save_model_for_hf_trainer (line 200) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
  function smart_tokenizer_and_embedding_resize (line 255) | def smart_tokenizer_and_embedding_resize(
  function _tokenize_fn (line 280) | def _tokenize_fn(strings: Sequence[str],
  function _mask_targets (line 307) | def _mask_targets(target, tokenized_lens, speakers):
  function _add_speaker_and_signal (line 318) | def _add_speaker_and_signal(header, source, get_conversation=True):
  function preprocess_multimodal (line 339) | def preprocess_multimodal(
  function preprocess_llama_2 (line 366) | def preprocess_llama_2(
  function preprocess_v1 (line 448) | def preprocess_v1(
  function preprocess_plain (line 537) | def preprocess_plain(
  function preprocess (line 559) | def preprocess(
  function extend_list (line 610) | def extend_list(original_list, multiplier):
  function extract_coors (line 629) | def extract_coors(s):
  function regulate_box (line 651) | def regulate_box(box, img_w, img_h):
  class LazySupervisedDataset (line 655) | class LazySupervisedDataset(Dataset):
    method load_vg_object (line 658) | def load_vg_object(self, data_path, image_folder):
    method load_vg_yesno_object (line 666) | def load_vg_yesno_object(self, data_path, image_folder):
    method load_vg_attribute (line 674) | def load_vg_attribute(self, data_path, image_folder):
    method load_vg_relation (line 682) | def load_vg_relation(self, data_path, image_folder):
    method load_vg_region (line 690) | def load_vg_region(self, data_path, image_folder):
    method load_git_instruction (line 698) | def load_git_instruction(self, data_path, image_folder):
    method load_llava (line 706) | def load_llava(self, data_path, image_folder):
    method load_grounded_llava_boxes (line 713) | def load_grounded_llava_boxes(self, data_path, image_folder):
    method load_refexp (line 721) | def load_refexp(self, data_path, image_folder):
    method load_flickr (line 729) | def load_flickr(self, data_path, image_folder):
    method load_objects365 (line 737) | def load_objects365(self, data_path, image_folder):
    method load_cc3m (line 745) | def load_cc3m(self, data_path, image_folder):
    method __init__ (line 752) | def __init__(self, data_path: str,
    method __len__ (line 838) | def __len__(self):
    method get_obj_center (line 841) | def get_obj_center(self, box, ratio_w, ratio_h, std_dev_weight=0.15):
    method sample_point_in_segment (line 867) | def sample_point_in_segment(self, mask, ratio_w, ratio_h, box=None, sa...
    method get_bbox_coor (line 904) | def get_bbox_coor(self, box, ratio_w, ratio_h):
    method generate_mask_for_feature (line 908) | def generate_mask_for_feature(self, coor, box, mask, raw_w, raw_h):
    method __getitem__ (line 949) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class DataCollatorForSupervisedDataset (line 1109) | class DataCollatorForSupervisedDataset(object):
    method __call__ (line 1114) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
  function make_supervised_data_module (line 1146) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
  function train (line 1160) | def train():

FILE: ferret/utils.py
  function build_logger (line 17) | def build_logger(logger_name, logger_filename):
  class StreamToLogger (line 60) | class StreamToLogger(object):
    method __init__ (line 64) | def __init__(self, logger, log_level=logging.INFO):
    method __getattr__ (line 70) | def __getattr__(self, attr):
    method write (line 73) | def write(self, buf):
    method flush (line 87) | def flush(self):
  function disable_torch_init (line 93) | def disable_torch_init():
  function violates_moderation (line 102) | def violates_moderation(text):
  function pretty_print_semaphore (line 123) | def pretty_print_semaphore(semaphore):

FILE: ferretui/ferretui/conversation.py
  class SeparatorStyle (line 10) | class SeparatorStyle(Enum):
  class Conversation (line 21) | class Conversation:
    method get_prompt (line 34) | def get_prompt(self):
    method append_message (line 121) | def append_message(self, role, message):
    method get_images (line 124) | def get_images(self, return_pil=False):
    method to_gradio_chatbot (line 179) | def to_gradio_chatbot(self):
    method copy (line 213) | def copy(self):
    method dict (line 224) | def dict(self):

FILE: ferretui/ferretui/eval/model_UI.py
  function split_list (line 21) | def split_list(lst, n):
  function get_chunk (line 26) | def get_chunk(lst, n, k):
  function generate_mask_for_feature (line 30) | def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
  function get_task_from_file (line 57) | def get_task_from_file(file):
  function get_bbox_coor (line 70) | def get_bbox_coor(box, ratio_w, ratio_h):
  function get_model_name_from_path (line 73) | def get_model_name_from_path(model_path):
  class UIData (line 81) | class UIData:
    method __init__ (line 82) | def __init__(self, data_path, image_path, args) -> None:
    method ids (line 90) | def ids(self):
    method __getitem__ (line 93) | def __getitem__(self, idx):
  function eval_model (line 129) | def eval_model(args):

FILE: ferretui/ferretui/eval/webpage/script.js
  function text2Markdown (line 35) | function text2Markdown(text) {
  function capitalizeFirstChar (line 41) | function capitalizeFirstChar(str) {
  function updateQuestionSelect (line 48) | function updateQuestionSelect(question_id) {
  function updateModelSelect (line 64) | function updateModelSelect() {
  function populateModels (line 70) | function populateModels(models) {
  function populateQuestions (line 81) | function populateQuestions(questions) {
  function displayQuestion (line 110) | function displayQuestion(index) {
  function displayAnswers (line 116) | function displayAnswers(index) {
  function switchQuestionAndCategory (line 203) | function switchQuestionAndCategory() {
  function updateExpandButtonVisibility (line 226) | function updateExpandButtonVisibility(card) {

FILE: ferretui/ferretui/mm_utils.py
  function select_best_resolution (line 13) | def select_best_resolution(original_size, possible_resolutions):
  function resize_and_pad_image (line 43) | def resize_and_pad_image(image, target_resolution, is_pad=False):
  function divide_to_patches (line 79) | def divide_to_patches(image, patch_size):
  function get_anyres_image_grid_shape (line 101) | def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
  function process_anyres_image (line 121) | def process_anyres_image(image, processor, grid_pinpoints, image_process...
  function load_image_from_base64 (line 160) | def load_image_from_base64(image):
  function expand2square (line 164) | def expand2square(pil_img, background_color):
  function process_images (line 178) | def process_images(images, image_processor, model_cfg, image_process_fun...
  function tokenizer_image_token (line 198) | def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOK...
  function get_model_name_from_path (line 220) | def get_model_name_from_path(model_path):
  class KeywordsStoppingCriteria (line 228) | class KeywordsStoppingCriteria(StoppingCriteria):
    method __init__ (line 229) | def __init__(self, keywords, tokenizer, input_ids):
    method call_for_batch (line 243) | def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.F...
    method __call__ (line 256) | def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTe...

FILE: ferretui/ferretui/model/apply_delta.py
  function apply_delta (line 13) | def apply_delta(base_model_path, target_model_path, delta_path):

FILE: ferretui/ferretui/model/builder.py
  function load_pretrained_model (line 26) | def load_pretrained_model(model_path, model_base, model_name, use_safete...

FILE: ferretui/ferretui/model/consolidate.py
  function consolidate_ckpt (line 13) | def consolidate_ckpt(src_path, dst_path):

FILE: ferretui/ferretui/model/ferret_arch.py
  function rand_sample (line 36) | def rand_sample(x, max_len):
  function rand_sample_repeat (line 44) | def rand_sample_repeat(x, max_len):
  function point_sample (line 56) | def point_sample(input, point_coords, return_dtype, **kwargs):
  function farthest_point_sample (line 82) | def farthest_point_sample(xyz, npoint):
  function index_points (line 105) | def index_points(points, idx):
  function square_distance (line 124) | def square_distance(src, dst):
  function knn_point (line 146) | def knn_point(nsample, xyz, new_xyz):
  class ConvReLULN1D (line 160) | class ConvReLULN1D(nn.Module):
    method __init__ (line 161) | def __init__(self, in_channels, out_channels, kernel_size=1, bias=True):
    method forward (line 170) | def forward(self, x):
  function normal_init (line 180) | def normal_init(module, mean=0, std=1, bias=0):
  class GeoRegionSampler (line 187) | class GeoRegionSampler(nn.Module):
    method __init__ (line 188) | def __init__(self,
    method norm_init_weights (line 229) | def norm_init_weights(self):
    method forward (line 235) | def forward(self,
  class FerretMetaModel (line 343) | class FerretMetaModel:
    method __init__ (line 345) | def __init__(self, config):
    method get_vision_tower (line 378) | def get_vision_tower(self):
    method initialize_vision_modules (line 384) | def initialize_vision_modules(self, model_args, fsdp=None,
  function unpad_image (line 477) | def unpad_image(tensor, original_size):
  class FerretMetaForCausalLM (line 508) | class FerretMetaForCausalLM(ABC):
    method get_model (line 511) | def get_model(self):
    method get_vision_tower (line 514) | def get_vision_tower(self):
    method encode_images (line 517) | def encode_images(self, images, region_flag=False, region_geo_sampler=...
    method extract_region_feature (line 530) | def extract_region_feature(self, region_feature_map, region_masks, ori...
    method prepare_inputs_labels_for_multimodal (line 565) | def prepare_inputs_labels_for_multimodal(
    method initialize_vision_tokenizer (line 873) | def initialize_vision_tokenizer(self, model_args, tokenizer, add_regio...

FILE: ferretui/ferretui/model/language_model/ferret_gemma.py
  class FerretGemmaConfig (line 34) | class FerretGemmaConfig(GemmaConfig):
  class FerretGemmaModel (line 38) | class FerretGemmaModel(FerretMetaModel, GemmaModel):
    method __init__ (line 41) | def __init__(self, config: GemmaConfig):
  class FerretGemmaForCausalLM (line 45) | class FerretGemmaForCausalLM(GemmaForCausalLM, FerretMetaForCausalLM):
    method __init__ (line 48) | def __init__(self, config):
    method get_model (line 57) | def get_model(self):
    method forward (line 60) | def forward(
    method generate (line 114) | def generate(
    method prepare_inputs_for_generation (line 155) | def prepare_inputs_for_generation(self, input_ids, past_key_values=Non...

FILE: ferretui/ferretui/model/language_model/ferret_llama.py
  class FerretConfig (line 30) | class FerretConfig(LlamaConfig):
  class FerretLlamaModel (line 34) | class FerretLlamaModel(FerretMetaModel, LlamaModel):
    method __init__ (line 37) | def __init__(self, config: LlamaConfig):
  class FerretLlamaForCausalLM (line 41) | class FerretLlamaForCausalLM(LlamaForCausalLM, FerretMetaForCausalLM):
    method __init__ (line 44) | def __init__(self, config):
    method get_model (line 54) | def get_model(self):
    method forward (line 57) | def forward(
    method generate (line 109) | def generate(
    method prepare_inputs_for_generation (line 150) | def prepare_inputs_for_generation(self, input_ids, past_key_values=None,

FILE: ferretui/ferretui/model/language_model/ferret_mpt.py
  class FerretMptConfig (line 25) | class FerretMptConfig(MptConfig):
  class FerretMptModel (line 29) | class FerretMptModel(FerretMetaModel, MptModel):
    method __init__ (line 32) | def __init__(self, config: MptConfig):
    method embed_tokens (line 36) | def embed_tokens(self, x):
  class FerretMptForCausalLM (line 40) | class FerretMptForCausalLM(MptForCausalLM, FerretMetaForCausalLM):
    method __init__ (line 44) | def __init__(self, config):
    method get_model (line 53) | def get_model(self):
    method _set_gradient_checkpointing (line 56) | def _set_gradient_checkpointing(self, module, value=False):
    method forward (line 60) | def forward(
    method prepare_inputs_for_generation (line 87) | def prepare_inputs_for_generation(self, input_ids, past_key_values=Non...

FILE: ferretui/ferretui/model/make_delta.py
  function make_delta (line 13) | def make_delta(base_model_path, target_model_path, delta_path, hub_repo_...

FILE: ferretui/ferretui/model/multimodal_encoder/builder.py
  function build_vision_tower (line 5) | def build_vision_tower(vision_tower_cfg, **kwargs):

FILE: ferretui/ferretui/model/multimodal_encoder/clip_encoder.py
  class CLIPImageProcessor_Ferret (line 19) | class CLIPImageProcessor_Ferret(CLIPImageProcessor):
    method resize (line 20) | def resize(
  class CLIPVisionTower (line 50) | class CLIPVisionTower(nn.Module):
    method __init__ (line 51) | def __init__(self, vision_tower, args, delay_load=False):
    method load_model (line 68) | def load_model(self, device_map=None):
    method feature_select (line 83) | def feature_select(self, image_forward_outs):
    method forward (line 94) | def forward(self, images):
    method dummy_feature (line 108) | def dummy_feature(self):
    method dtype (line 112) | def dtype(self):
    method device (line 116) | def device(self):
    method config (line 120) | def config(self):
    method hidden_size (line 127) | def hidden_size(self):
    method num_patches_per_side (line 131) | def num_patches_per_side(self):
    method num_patches (line 135) | def num_patches(self):
  class CLIPVisionTowerS2 (line 140) | class CLIPVisionTowerS2(CLIPVisionTower):
    method __init__ (line 141) | def __init__(self, vision_tower, args, delay_load=False):
    method load_model (line 161) | def load_model(self, device_map=None):
    method forward_feature (line 176) | def forward_feature(self, images):
    method forward (line 182) | def forward(self, images):
    method hidden_size (line 194) | def hidden_size(self):

FILE: ferretui/ferretui/model/multimodal_projector/builder.py
  class IdentityMap (line 6) | class IdentityMap(nn.Module):
    method __init__ (line 7) | def __init__(self):
    method forward (line 10) | def forward(self, x, *args, **kwargs):
    method config (line 14) | def config(self):
  class SimpleResBlock (line 18) | class SimpleResBlock(nn.Module):
    method __init__ (line 19) | def __init__(self, channels):
    method forward (line 28) | def forward(self, x):
  function build_vision_projector (line 33) | def build_vision_projector(config, delay_load=False, **kwargs):

FILE: ferretui/ferretui/model/utils.py
  function auto_upgrade (line 4) | def auto_upgrade(config):

FILE: ferretui/ferretui/serve/cli.py
  function load_image (line 18) | def load_image(image_file):
  function main (line 27) | def main(args):

FILE: ferretui/ferretui/serve/controller.py
  class DispatchMethod (line 28) | class DispatchMethod(Enum):
    method from_str (line 33) | def from_str(cls, name):
  class WorkerInfo (line 43) | class WorkerInfo:
  function heart_beat_controller (line 51) | def heart_beat_controller(controller):
  class Controller (line 57) | class Controller:
    method __init__ (line 58) | def __init__(self, dispatch_method: str):
    method register_worker (line 69) | def register_worker(self, worker_name: str, check_heart_beat: bool,
    method get_worker_status (line 88) | def get_worker_status(self, worker_name: str):
    method remove_worker (line 101) | def remove_worker(self, worker_name: str):
    method refresh_all_workers (line 104) | def refresh_all_workers(self):
    method list_models (line 112) | def list_models(self):
    method get_worker_address (line 120) | def get_worker_address(self, model_name: str):
    method receive_heart_beat (line 173) | def receive_heart_beat(self, worker_name: str, queue_length: int):
    method remove_stable_workers_by_expiration (line 183) | def remove_stable_workers_by_expiration(self):
    method worker_api_generate_stream (line 193) | def worker_api_generate_stream(self, params):
    method worker_api_get_status (line 220) | def worker_api_get_status(self):
  function register_worker (line 243) | async def register_worker(request: Request):
  function refresh_all_workers (line 251) | async def refresh_all_workers():
  function list_models (line 256) | async def list_models():
  function get_worker_address (line 262) | async def get_worker_address(request: Request):
  function receive_heart_beat (line 269) | async def receive_heart_beat(request: Request):
  function worker_api_generate_stream (line 277) | async def worker_api_generate_stream(request: Request):
  function worker_api_get_status (line 284) | async def worker_api_get_status(request: Request):

FILE: ferretui/ferretui/serve/gradio_web_server.py
  function get_conv_log_filename (line 32) | def get_conv_log_filename():
  function get_model_list (line 38) | def get_model_list():
  function load_demo (line 58) | def load_demo(url_params, request: gr.Request):
  function load_demo_refresh_model_list (line 71) | def load_demo_refresh_model_list(request: gr.Request):
  function vote_last_response (line 82) | def vote_last_response(state, vote_type, model_selector, request: gr.Req...
  function upvote_last_response (line 94) | def upvote_last_response(state, model_selector, request: gr.Request):
  function downvote_last_response (line 100) | def downvote_last_response(state, model_selector, request: gr.Request):
  function flag_last_response (line 106) | def flag_last_response(state, model_selector, request: gr.Request):
  function regenerate (line 112) | def regenerate(state, image_process_mode, request: gr.Request):
  function clear_history (line 122) | def clear_history(request: gr.Request):
  function add_text (line 128) | def add_text(state, text, image, image_process_mode, request: gr.Request):
  function http_bot (line 154) | def http_bot(state, model_selector, temperature, top_p, max_new_tokens, ...
  function build_demo (line 315) | def build_demo(embed_mode, cur_dir=None, concurrency_count=10):

FILE: ferretui/ferretui/serve/model_worker.py
  function heart_beat_worker (line 37) | def heart_beat_worker(controller):
  class ModelWorker (line 44) | class ModelWorker:
    method __init__ (line 45) | def __init__(self, controller_addr, worker_addr,
    method register_to_controller (line 75) | def register_to_controller(self):
    method send_heart_beat (line 87) | def send_heart_beat(self):
    method get_queue_length (line 108) | def get_queue_length(self):
    method get_status (line 115) | def get_status(self):
    method generate_stream (line 123) | def generate_stream(self, params):
    method generate_stream_gate (line 195) | def generate_stream_gate(self, params):
  function release_model_semaphore (line 225) | def release_model_semaphore(fn=None):
  function generate_stream (line 232) | async def generate_stream(request: Request):
  function get_status (line 248) | async def get_status(request: Request):

FILE: ferretui/ferretui/serve/sglang_worker.py
  function heart_beat_worker (line 38) | def heart_beat_worker(controller):
  function pipeline (line 45) | def pipeline(s, prompt, max_tokens):
  class ModelWorker (line 54) | class ModelWorker:
    method __init__ (line 55) | def __init__(self, controller_addr, worker_addr, sgl_endpoint,
    method register_to_controller (line 85) | def register_to_controller(self):
    method send_heart_beat (line 97) | def send_heart_beat(self):
    method get_queue_length (line 118) | def get_queue_length(self):
    method get_status (line 125) | def get_status(self):
    method generate_stream (line 132) | async def generate_stream(self, params):
    method generate_stream_gate (line 172) | async def generate_stream_gate(self, params):
  function release_model_semaphore (line 195) | def release_model_semaphore(fn=None):
  function generate_stream (line 202) | async def generate_stream(request: Request):
  function get_status (line 218) | async def get_status(request: Request):

FILE: ferretui/ferretui/serve/test_message.py
  function main (line 9) | def main():

FILE: ferretui/ferretui/train/ferret_trainer.py
  function get_vision_tower_state_maybe_zero_3 (line 19) | def get_vision_tower_state_maybe_zero_3(named_params, keys_to_match=['']):
  function maybe_zero_3 (line 26) | def maybe_zero_3(param, ignore_status=False, name=None):
  function get_mm_adapter_state_maybe_zero_3 (line 40) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
  function split_to_even_chunks (line 46) | def split_to_even_chunks(indices, lengths, num_chunks):
  function get_modality_length_grouped_indices (line 68) | def get_modality_length_grouped_indices(lengths, batch_size, world_size,...
  function get_length_grouped_indices (line 97) | def get_length_grouped_indices(lengths, batch_size, world_size, generato...
  class LengthGroupedSampler (line 108) | class LengthGroupedSampler(Sampler):
    method __init__ (line 114) | def __init__(
    method __len__ (line 131) | def __len__(self):
    method __iter__ (line 134) | def __iter__(self):
  class FerretTrainer (line 141) | class FerretTrainer(Trainer):
    method _get_train_sampler (line 142) | def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
    method create_optimizer (line 157) | def create_optimizer(self):
    method _save_checkpoint (line 283) | def _save_checkpoint(self, model, trial, metrics=None):
    method _save (line 306) | def _save(self, output_dir: Optional[str] = None, state_dict=None):

FILE: ferretui/ferretui/train/llama_flash_attn_monkey_patch.py
  function forward (line 16) | def forward(
  function _prepare_decoder_attention_mask (line 98) | def _prepare_decoder_attention_mask(
  function replace_llama_attn_with_flash_attn (line 105) | def replace_llama_attn_with_flash_attn():

FILE: ferretui/ferretui/train/llama_xformers_attn_monkey_patch.py
  function replace_llama_attn_with_xformers_attn (line 19) | def replace_llama_attn_with_xformers_attn():
  function xformers_forward (line 23) | def xformers_forward(

FILE: ferretui/ferretui/train/train.py
  function rank0_print (line 56) | def rank0_print(*args):
  class ModelArguments (line 66) | class ModelArguments:
  class DataArguments (line 87) | class DataArguments:
  class TrainingArguments (line 104) | class TrainingArguments(transformers.TrainingArguments):
  function maybe_zero_3 (line 145) | def maybe_zero_3(param, ignore_status=False, name=None):
  function get_peft_state_maybe_zero_3 (line 161) | def get_peft_state_maybe_zero_3(named_params, bias):
  function get_peft_state_non_lora_maybe_zero_3 (line 188) | def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only...
  function get_mm_adapter_state_maybe_zero_3 (line 197) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
  function get_vision_tower_state_maybe_zero_3 (line 205) | def get_vision_tower_state_maybe_zero_3(named_params, keys_to_match=['']):
  function find_all_linear_names (line 213) | def find_all_linear_names(model, qv_proj_only=False):
  function safe_save_model_for_hf_trainer (line 232) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
  function smart_tokenizer_and_embedding_resize (line 275) | def smart_tokenizer_and_embedding_resize(
  function _tokenize_fn (line 300) | def _tokenize_fn(strings: Sequence[str],
  function _mask_targets (line 327) | def _mask_targets(target, tokenized_lens, speakers):
  function _add_speaker_and_signal (line 338) | def _add_speaker_and_signal(header, source, get_conversation=True):
  function preprocess_multimodal (line 359) | def preprocess_multimodal(
  function preprocess_llama_2 (line 387) | def preprocess_llama_2(
  function preprocess_llama3 (line 471) | def preprocess_llama3(
  function preprocess_v1 (line 561) | def preprocess_v1(
  function preprocess_gemma (line 649) | def preprocess_gemma(
  function preprocess_phi3 (line 731) | def preprocess_phi3(
  function preprocess_mpt (line 826) | def preprocess_mpt(
  function preprocess_plain (line 917) | def preprocess_plain(
  function preprocess (line 942) | def preprocess(
  function extend_list (line 1001) | def extend_list(original_list, multiplier):
  function extract_coors (line 1020) | def extract_coors(s):
  function regulate_box (line 1043) | def regulate_box(box, img_w, img_h):
  class LazySupervisedDataset (line 1046) | class LazySupervisedDataset(Dataset):
    method shard_data (line 1049) | def shard_data(self, datas, data_path, ori_counts):
    method load_pretrain (line 1055) | def load_pretrain(self, data_path, image_folder):
    method load_llava_mixed (line 1065) | def load_llava_mixed(self, data_path, image_folder):
    method load_git_instruction (line 1077) | def load_git_instruction(self, data_path, image_folder):
    method load_vg_element (line 1088) | def load_vg_element(self, data_path, image_folder):
    method load_llava_grounded (line 1099) | def load_llava_grounded(self, data_path, image_folder):
    method load_flickr (line 1110) | def load_flickr(self, data_path, image_folder):
    method load_refexp (line 1121) | def load_refexp(self, data_path, image_folder):
    method load_obj365 (line 1132) | def load_obj365(self, data_path, image_folder):
    method load_sharegpt4v (line 1143) | def load_sharegpt4v(self, data_path, image_folder):
    method load_lvisinstruct4v (line 1154) | def load_lvisinstruct4v(self, data_path, image_folder):
    method load_vqa (line 1166) | def load_vqa(self, data_path, image_folder):
    method load_swit (line 1176) | def load_swit(self, data_path, image_folder):
    method load_sharegpt (line 1186) | def load_sharegpt(self, data_path):
    method load_screen2words (line 1195) | def load_screen2words(self, data_path, image_folder):
    method load_widgetcaptions (line 1205) | def load_widgetcaptions(self, data_path, image_folder):
    method load_taperception (line 1216) | def load_taperception(self, data_path, image_folder):
    method load_widget_listing (line 1227) | def load_widget_listing(self, data_path, image_folder):
    method load_ocr (line 1238) | def load_ocr(self, data_path, image_folder):
    method load_find_text (line 1249) | def load_find_text(self, data_path, image_folder):
    method load_icon_recognition (line 1260) | def load_icon_recognition(self, data_path, image_folder):
    method load_find_icons (line 1271) | def load_find_icons(self, data_path, image_folder):
    method load_widget_classification (line 1282) | def load_widget_classification(self, data_path, image_folder):
    method load_find_widget (line 1293) | def load_find_widget(self, data_path, image_folder):
    method load_detailed_description (line 1304) | def load_detailed_description(self, data_path, image_folder):
    method load_conversation_perception (line 1314) | def load_conversation_perception(self, data_path, image_folder):
    method load_conversation_interaction (line 1324) | def load_conversation_interaction(self, data_path, image_folder):
    method load_function (line 1335) | def load_function(self, data_path, image_folder):
    method __init__ (line 1345) | def __init__(self, data_path: str,
    method __len__ (line 1510) | def __len__(self):
    method sync_iter_counts (line 1513) | def sync_iter_counts(self):
    method get_obj_center (line 1530) | def get_obj_center(self, box, ratio_w, ratio_h, std_dev_weight=0.15):
    method sample_point_in_segment (line 1556) | def sample_point_in_segment(self, mask, ratio_w, ratio_h, box=None, sa...
    method get_bbox_coor (line 1592) | def get_bbox_coor(self, box, ratio_w, ratio_h):
    method generate_mask_for_feature (line 1595) | def generate_mask_for_feature(self, coor, box, mask, raw_w, raw_h):
    method lengths (line 1637) | def lengths(self):
    method modality_lengths (line 1646) | def modality_lengths(self):
    method format_unicode_filenames (line 1656) | def format_unicode_filenames(filename):
    method __getitem__ (line 1662) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class DataCollatorForSupervisedDataset (line 1857) | class DataCollatorForSupervisedDataset(object):
    method __call__ (line 1862) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
  function make_supervised_data_module (line 1896) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
  function unfreeze_vit (line 1910) | def unfreeze_vit(vision_tower):
  function format_bytes (line 1915) | def format_bytes(size):
  function train (line 1927) | def train(attn_implementation=None):
  function init_distributed_mode (line 2222) | def init_distributed_mode():

FILE: ferretui/ferretui/utils.py
  function build_logger (line 17) | def build_logger(logger_name, logger_filename):
  class StreamToLogger (line 60) | class StreamToLogger(object):
    method __init__ (line 64) | def __init__(self, logger, log_level=logging.INFO):
    method __getattr__ (line 70) | def __getattr__(self, attr):
    method write (line 73) | def write(self, buf):
    method flush (line 87) | def flush(self):
  function disable_torch_init (line 93) | def disable_torch_init():
  function violates_moderation (line 102) | def violates_moderation(text):
  function pretty_print_semaphore (line 123) | def pretty_print_semaphore(semaphore):

FILE: scripts/extract_geosampler_and_mm_projector.py
  function parse_args (line 39) | def parse_args():

FILE: scripts/verify_equal.py
  function verify_equal (line 15) | def verify_equal(old_model_path, new_model_path):

Download .json

Condensed preview — 123 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (9,960K chars).

[
  {
    "path": ".gitignore",
    "chars": 328,
    "preview": "*.egg-info\n*.pyc\nbuild/\n\n# compilation and distribution\n__pycache__\n_ext\n*.so\ndist/\n\n# pytorch/python/numpy formats\n*.pt"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 3357,
    "preview": "# Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 747,
    "preview": "# Contribution Guide\n\nThanks for your interest in contributing. This project was released to accompany a research paper "
  },
  {
    "path": "EVAL.md",
    "chars": 873,
    "preview": "# Evaluation\nAll evaluation scripts provided usage details/cases in the first several lines of codes. \n\n## Ferret-Bench\n"
  },
  {
    "path": "LICENSE",
    "chars": 2317,
    "preview": "Copyright (C) 2023 Apple Inc. All Rights Reserved.\n\nIMPORTANT:  This Apple software is supplied to you by Apple\nInc. (\"A"
  },
  {
    "path": "README.md",
    "chars": 7323,
    "preview": "<!-- # Project Name\n\nThis software project accompanies the research paper, [Paper title](https://arxiv.org).\n\nBrief desc"
  },
  {
    "path": "experiments/ferret_13b_train.sh",
    "chars": 3084,
    "preview": "#!/usr/bin/env bash\nset -xe\n\nmkdir -p checkpoints\n\necho \"Start Fine-Tuning\"\n# =================== Training ============="
  },
  {
    "path": "experiments/ferret_7b_train.sh",
    "chars": 3057,
    "preview": "#!/usr/bin/env bash\nset -xe\n\nmkdir -p checkpoints\n\n# =================== Training ======================\ndata_path=(\n   "
  },
  {
    "path": "ferret/__init__.py",
    "chars": 42,
    "preview": "from .model import FERRETLlamaForCausalLM\n"
  },
  {
    "path": "ferret/constants.py",
    "chars": 293,
    "preview": "CONTROLLER_HEART_BEAT_EXPIRATION = 30\nWORKER_HEART_BEAT_INTERVAL = 15\n\nLOGDIR = \".\"\n\n# Model Constants\nIGNORE_INDEX = -1"
  },
  {
    "path": "ferret/conversation.py",
    "chars": 11137,
    "preview": "import dataclasses\nfrom enum import auto, Enum\nfrom typing import List, Tuple\n\nVOCAB_IMAGE_W = 1000  # 224\nVOCAB_IMAGE_H"
  },
  {
    "path": "ferret/eval/eval_flickr_entities.py",
    "chars": 23139,
    "preview": "\"\"\"\nUsage:\n\npython ferret/eval/eval_flickr_entities.py \\\n    --prediction_file result_checkpoint-final/flickr_result/fin"
  },
  {
    "path": "ferret/eval/eval_gpt_review_3newclass.py",
    "chars": 6037,
    "preview": "import argparse\nimport json\nimport os\n\nimport openai\nimport time\nimport re\nimport pdb\nfrom tqdm import tqdm\n\nNUM_SECONDS"
  },
  {
    "path": "ferret/eval/eval_lvis.py",
    "chars": 2350,
    "preview": "\"\"\"\nUsage:\n- Eval Prediction:\npython ferret/eval/eval_lvis.py --pred_file=[your generated result by running ferret/eval/"
  },
  {
    "path": "ferret/eval/eval_pope.py",
    "chars": 3361,
    "preview": "\"\"\"\nUsage:\n\npython ferret/eval/eval_pope.py \\\n    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result"
  },
  {
    "path": "ferret/eval/eval_refexp.py",
    "chars": 7771,
    "preview": "\"\"\"\nUsage:\n\npython ferret/eval/eval_refexp.py \\\n    --prediction_file final_result/ferret_13b_checkpoint-final/refexp_re"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl",
    "chars": 17968,
    "preview": "{\"question_id\": 0, \"image\": \"000000125472.jpg\", \"category\": \"ground_conv\", \"text\": \"The man [0.201, 0.002, 0.940, 0.758]"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl",
    "chars": 114765,
    "preview": "{\"question_id\": 0, \"image\": \"000000125472.jpg\", \"category\": \"ground_conv\", \"text\": \"Objects:\\nObject 0 : axle at [0.447,"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl",
    "chars": 7273,
    "preview": "{\"question_id\": 0, \"image\": \"000000125472.jpg\", \"category\": \"ground_conv\", \"text\": \"What is the man in the image doing a"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/refer_caption/answer.jsonl",
    "chars": 15912,
    "preview": "{\"question_id\": 0, \"image\": \"000000069138.jpg\", \"category\": \"refer_desc\", \"text\": \"The object is a sign that is placed o"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/refer_caption/context.jsonl",
    "chars": 120615,
    "preview": "{\"question_id\": 0, \"image\": \"000000069138.jpg\", \"category\": \"refer_desc\", \"text\": \"Objects:\\nObject 0 : arrows at [0.000"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/refer_caption/question.jsonl",
    "chars": 7525,
    "preview": "{\"question_id\": 0, \"image\": \"000000069138.jpg\", \"category\": \"refer_desc\", \"text\": \"What is the interaction between the o"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/refer_reason/answer.jsonl",
    "chars": 15006,
    "preview": "{\"question_id\": 0, \"image\": \"000000130566.jpg\", \"category\": \"refer_reason\", \"text\": \"The object is a windshield on the t"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/refer_reason/context.jsonl",
    "chars": 115444,
    "preview": "{\"question_id\": 0, \"image\": \"000000130566.jpg\", \"category\": \"refer_reason\", \"text\": \"Objects:\\nObject 0 : buds at [0.130"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl",
    "chars": 6400,
    "preview": "{\"question_id\": 0, \"image\": \"000000130566.jpg\", \"category\": \"refer_reason\", \"text\": \"What might be the purpose of the ob"
  },
  {
    "path": "ferret/eval/ferret_gpt4_data/rule.json",
    "chars": 5228,
    "preview": "{\n    \"refer_desc\":  {\"role\": \"Assistant\", \"prompt\": \"We would like to request your feedback on the performance of two A"
  },
  {
    "path": "ferret/eval/gpt4_eval_script.sh",
    "chars": 2560,
    "preview": "#!/bin/bash\n \nCHECKPOINT_FILE='ferret_ft/final-checkpoint'\n\nCUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_gpt4eval_"
  },
  {
    "path": "ferret/eval/model_flickr.py",
    "chars": 10278,
    "preview": "\"\"\"\nUsage:\n\n--data_path: path of flickr30k annotation. \n--image_path:  path of flickr30k test images. \n--answers-file: p"
  },
  {
    "path": "ferret/eval/model_gpt4eval_3newclass.py",
    "chars": 12329,
    "preview": "\"\"\"\nUsage:\n\nExample:\nCUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_gpt4eval_3newclass \\\n    --model-path checkpoint"
  },
  {
    "path": "ferret/eval/model_lvis.py",
    "chars": 12075,
    "preview": "\"\"\"\nUsage:\n--data_path: path of LVIS eval annotation. \n--image_path: path of coco val 2017 images.\n--answers-file: path "
  },
  {
    "path": "ferret/eval/model_point_cls_single_image.py",
    "chars": 7660,
    "preview": "\"\"\"\nUsage:\n- If eval on center point:\nCUDA_VISIBLE_DEVICES=1 python -m ferret.eval.model_point_cls_single_image \\\n    --"
  },
  {
    "path": "ferret/eval/model_pope.py",
    "chars": 7963,
    "preview": "\"\"\"\nUsage:\n--data_path: path of pope annotation. \n--image_path:  path of coco2014 val images. \n--answers-file: path of o"
  },
  {
    "path": "ferret/eval/model_refcoco.py",
    "chars": 10018,
    "preview": "\"\"\"\nUsage:\n--data_path: path of refcoco annotation. \n--image_path:  path of refcoco images. \n--answers-file: path of out"
  },
  {
    "path": "ferret/eval/summarize_gpt_review.py",
    "chars": 2381,
    "preview": "import json\nimport os\nfrom collections import defaultdict\n\nimport numpy as np\n\nimport argparse\n\ndef parse_args():\n    pa"
  },
  {
    "path": "ferret/mm_utils.py",
    "chars": 2817,
    "preview": "from PIL import Image\nfrom io import BytesIO\nimport base64\n\nimport torch\nfrom transformers import StoppingCriteria\nfrom "
  },
  {
    "path": "ferret/model/__init__.py",
    "chars": 78,
    "preview": "from .language_model.ferret_llama import FERRETLlamaForCausalLM, FERRETConfig\n"
  },
  {
    "path": "ferret/model/apply_delta.py",
    "chars": 3399,
    "preview": "\"\"\"\nUsage:\n# 7B\npython3 -m ferret.model.apply_delta \\\n    --base ./model/vicuna-7b-v1-3 \\\n    --target ./model/ferret-7b"
  },
  {
    "path": "ferret/model/builder.py",
    "chars": 7300,
    "preview": "#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not use this file except in compliance"
  },
  {
    "path": "ferret/model/consolidate.py",
    "chars": 916,
    "preview": "\"\"\"\nUsage:\npython3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate\n"
  },
  {
    "path": "ferret/model/ferret_arch.py",
    "chars": 34334,
    "preview": "#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not use this file except in compliance"
  },
  {
    "path": "ferret/model/language_model/ferret_llama.py",
    "chars": 5315,
    "preview": "#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not use this file except in compliance"
  },
  {
    "path": "ferret/model/make_delta.py",
    "chars": 3839,
    "preview": "\"\"\"\nUsage:\n# 7B\npython3 -m ferret.model.make_delta \\\n    --base ./model/vicuna-7b-v1-3 \\\n    --target ./checkpoints/ferr"
  },
  {
    "path": "ferret/model/multimodal_encoder/builder.py",
    "chars": 524,
    "preview": "import os\nfrom .clip_encoder import CLIPVisionTower\n\n\ndef build_vision_tower(vision_tower_cfg, **kwargs):\n    vision_tow"
  },
  {
    "path": "ferret/model/multimodal_encoder/clip_encoder.py",
    "chars": 4992,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig\n# Add"
  },
  {
    "path": "ferret/model/utils.py",
    "chars": 928,
    "preview": "from transformers import AutoConfig\n\n\ndef auto_upgrade(config):\n    cfg = AutoConfig.from_pretrained(config)\n    if 'lla"
  },
  {
    "path": "ferret/serve/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ferret/serve/controller.py",
    "chars": 9938,
    "preview": "\"\"\"\nA controller manages distributed workers.\nIt sends worker addresses to clients.\n\"\"\"\nimport argparse\nimport asyncio\ni"
  },
  {
    "path": "ferret/serve/dejavu/.uuid",
    "chars": 36,
    "preview": "06a92cd7-c698-4e59-b980-58e4bc162946"
  },
  {
    "path": "ferret/serve/gradio_css.py",
    "chars": 2717,
    "preview": "code_highlight_css = (\n\"\"\"\n#chatbot .hll { background-color: #ffffcc }\n#chatbot .c { color: #408080; font-style: italic "
  },
  {
    "path": "ferret/serve/gradio_web_server.py",
    "chars": 33541,
    "preview": "'''\nUsage:\n\npython -m ferret.serve.gradio_web_server --controller http://localhost:10000 --add_region_feature\n'''\nimport"
  },
  {
    "path": "ferret/serve/model_worker.py",
    "chars": 14002,
    "preview": "\"\"\"\nA model worker executes the model.\nUsage:\n\nCUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0"
  },
  {
    "path": "ferret/serve/register_worker.py",
    "chars": 734,
    "preview": "\"\"\"\nManually register workers.\n\nUsage:\npython3 -m fastchat.serve.register_worker --controller http://localhost:21001 --w"
  },
  {
    "path": "ferret/train/ferret_trainer.py",
    "chars": 2875,
    "preview": "import os\nimport torch\n\nfrom transformers import Trainer\nfrom typing import Optional\n\n\ndef maybe_zero_3(param, ignore_st"
  },
  {
    "path": "ferret/train/llama_flash_attn_monkey_patch.py",
    "chars": 4376,
    "preview": "# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:\nfrom typing import List, Optional, T"
  },
  {
    "path": "ferret/train/train.py",
    "chars": 61051,
    "preview": "# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:\n# Adopted from tatsu-lab@stanford_al"
  },
  {
    "path": "ferret/train/train_mem.py",
    "chars": 500,
    "preview": "# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:\n# Adopted from tatsu-lab@stanford_al"
  },
  {
    "path": "ferret/utils.py",
    "chars": 3986,
    "preview": "import datetime\nimport logging\nimport logging.handlers\nimport os\nimport sys\n\nimport requests\n\nfrom ferret.constants impo"
  },
  {
    "path": "ferretui/README.md",
    "chars": 4518,
    "preview": "# <img src=\"figs/ferretui_icon.png\" alt=\"ferretui icon\" width=\"40\" height=\"45\"> Ferret-UI: Grounded Mobile UI Understand"
  },
  {
    "path": "ferretui/ferretui/__init__.py",
    "chars": 42,
    "preview": "from .model import FerretLlamaForCausalLM\n"
  },
  {
    "path": "ferretui/ferretui/constants.py",
    "chars": 999,
    "preview": "CONTROLLER_HEART_BEAT_EXPIRATION = 30\nWORKER_HEART_BEAT_INTERVAL = 15\n\nLOGDIR = \".\"\n\n# Model Constants\nIGNORE_INDEX = -1"
  },
  {
    "path": "ferretui/ferretui/conversation.py",
    "chars": 20305,
    "preview": "import dataclasses\nfrom enum import auto, Enum\nfrom typing import List, Tuple\nimport base64\nfrom io import BytesIO\nfrom "
  },
  {
    "path": "ferretui/ferretui/eval/model_UI.py",
    "chars": 10784,
    "preview": "import argparse\nimport torch\nimport os\nimport json\nfrom tqdm import tqdm\n\nfrom ferretui.constants import IMAGE_TOKEN_IND"
  },
  {
    "path": "ferretui/ferretui/eval/table/answer/answer_alpaca-13b.jsonl",
    "chars": 57071,
    "preview": "{\"question_id\": 1, \"text\": \"Improving time management skills involves setting priorities, breaking tasks into smaller ch"
  },
  {
    "path": "ferretui/ferretui/eval/table/answer/answer_bard.jsonl",
    "chars": 112274,
    "preview": "{\"answer_id\": \"3oW4JY265ZPJGTYi2CgRYF\", \"model_id\": \"bard:20230327\", \"question_id\": 1, \"text\": \"Here are some tips on ho"
  },
  {
    "path": "ferretui/ferretui/eval/table/answer/answer_gpt35.jsonl",
    "chars": 107603,
    "preview": "{\"answer_id\": \"BZGowHM7L3RvtWRktKZjLT\", \"model_id\": \"gpt-3.5-turbo:20230327\", \"question_id\": 1, \"text\": \"Here are some t"
  },
  {
    "path": "ferretui/ferretui/eval/table/answer/answer_llama-13b.jsonl",
    "chars": 76353,
    "preview": "{\"answer_id\": \"J3UA6eGXGyFeUGqGpP3g34\", \"model_id\": \"llama-13b:v1\", \"question_id\": 1, \"text\": \"The following are some st"
  },
  {
    "path": "ferretui/ferretui/eval/table/answer/answer_vicuna-13b.jsonl",
    "chars": 131904,
    "preview": "{\"answer_id\": \"cV4zXygaNP6CXEsgdHMEqz\", \"model_id\": \"vicuna-13b:20230322-clean-lang\", \"question_id\": 1, \"text\": \"Improvi"
  },
  {
    "path": "ferretui/ferretui/eval/table/caps_boxes_coco2014_val_80.jsonl",
    "chars": 58574,
    "preview": "{\"id\": \"000000296284\", \"image\": \"000000296284.jpg\", \"captions\": [\"A donut shop is full of different flavors of donuts.\","
  },
  {
    "path": "ferretui/ferretui/eval/table/model.jsonl",
    "chars": 681,
    "preview": "{\"model_id\": \"vicuna-13b:20230322-clean-lang\", \"model_name\": \"vicuna-13b\", \"model_version\": \"20230322-clean-lang\", \"mode"
  },
  {
    "path": "ferretui/ferretui/eval/table/prompt.jsonl",
    "chars": 5129,
    "preview": "{\"prompt_id\": 1, \"system_prompt\": \"You are a helpful and precise assistant for checking the quality of the answer.\", \"pr"
  },
  {
    "path": "ferretui/ferretui/eval/table/question.jsonl",
    "chars": 12885,
    "preview": "{\"question_id\": 1, \"text\": \"How can I improve my time management skills?\", \"category\": \"generic\"}\n{\"question_id\": 2, \"te"
  },
  {
    "path": "ferretui/ferretui/eval/table/results/test_sqa_llava_13b_v0.json",
    "chars": 3950324,
    "preview": "{\n  \"acc\": 90.8983730252299,\n  \"correct\": 3855,\n  \"count\": 4241,\n  \"results\": {\n    \"4\": 1,\n    \"5\": 1,\n    \"11\": 1,\n   "
  },
  {
    "path": "ferretui/ferretui/eval/table/results/test_sqa_llava_lcs_558k_sqa_12e_vicuna_v1_3_13b.json",
    "chars": 3830902,
    "preview": "{\n  \"acc\": 91.08700778118369,\n  \"correct\": 3863,\n  \"count\": 4241,\n  \"results\": {\n    \"4\": 1,\n    \"5\": 1,\n    \"11\": 1,\n  "
  },
  {
    "path": "ferretui/ferretui/eval/table/review/review_alpaca-13b_vicuna-13b.jsonl",
    "chars": 73131,
    "preview": "{\"review_id\": \"QM5m5nnioWr8M2LFHsaQvu\", \"question_id\": 1, \"answer1_id\": \"kEL9ifUHDeYuAXzevje2se\", \"answer2_id\": \"cV4zXyg"
  },
  {
    "path": "ferretui/ferretui/eval/table/review/review_bard_vicuna-13b.jsonl",
    "chars": 73145,
    "preview": "{\"review_id\": \"4CeMvEQyE6fKMJwvSLY3P4\", \"question_id\": 1, \"answer1_id\": \"3oW4JY265ZPJGTYi2CgRYF\", \"answer2_id\": \"cV4zXyg"
  },
  {
    "path": "ferretui/ferretui/eval/table/review/review_gpt35_vicuna-13b.jsonl",
    "chars": 73399,
    "preview": "{\"review_id\": \"jyhS7AFj2mrFNqoRXQJDPS\", \"question_id\": 1, \"answer1_id\": \"BZGowHM7L3RvtWRktKZjLT\", \"answer2_id\": \"cV4zXyg"
  },
  {
    "path": "ferretui/ferretui/eval/table/review/review_llama-13b_vicuna-13b.jsonl",
    "chars": 67249,
    "preview": "{\"review_id\": \"WFp5i5yjjFethrgugKTDmX\", \"question_id\": 1, \"answer1_id\": \"J3UA6eGXGyFeUGqGpP3g34\", \"answer2_id\": \"cV4zXyg"
  },
  {
    "path": "ferretui/ferretui/eval/table/reviewer.jsonl",
    "chars": 604,
    "preview": "{\"reviewer_id\": \"gpt-4-0328-default\", \"prompt_id\": 1, \"metadata\": {\"temperature\": 0.2, \"max_tokens\": 1024}, \"description"
  },
  {
    "path": "ferretui/ferretui/eval/table/rule.json",
    "chars": 9098,
    "preview": "{\n    \"coding\": {\"role\": \"Assistant\", \"prompt\": \"Your task is to evaluate the coding abilities of the above two assistan"
  },
  {
    "path": "ferretui/ferretui/eval/webpage/index.html",
    "chars": 7664,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width"
  },
  {
    "path": "ferretui/ferretui/eval/webpage/script.js",
    "chars": 9967,
    "preview": "// Description: Script for the evaluation webpage.\n\nlet currentQuestionIndex = 1;\n\n// Store the model name mapping for l"
  },
  {
    "path": "ferretui/ferretui/eval/webpage/styles.css",
    "chars": 1822,
    "preview": "body {\n    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;\n    background-color: #f8f9fa;\n}\n\n.navbar-dark "
  },
  {
    "path": "ferretui/ferretui/mm_utils.py",
    "chars": 10519,
    "preview": "from PIL import Image\nfrom io import BytesIO\nimport base64\nimport torch\nimport math\nimport ast\nfrom typing import Option"
  },
  {
    "path": "ferretui/ferretui/model/__init__.py",
    "chars": 296,
    "preview": "try:\n    from .language_model.ferret_llama import FerretLlamaForCausalLM, FerretConfig\n    from .language_model.ferret_m"
  },
  {
    "path": "ferretui/ferretui/model/apply_delta.py",
    "chars": 1961,
    "preview": "\"\"\"\nUsage:\npython3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --de"
  },
  {
    "path": "ferretui/ferretui/model/builder.py",
    "chars": 8886,
    "preview": "#    Copyright 2023 Haotian Liu\n#\n#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not "
  },
  {
    "path": "ferretui/ferretui/model/consolidate.py",
    "chars": 920,
    "preview": "\"\"\"\nUsage:\npython3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate\n"
  },
  {
    "path": "ferretui/ferretui/model/ferret_arch.py",
    "chars": 49202,
    "preview": "#    Copyright 2023 Haotian Liu\n#\n#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not "
  },
  {
    "path": "ferretui/ferretui/model/language_model/ferret_gemma.py",
    "chars": 5888,
    "preview": "#    Copyright 2023 Haotian Liu\n#\n#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not "
  },
  {
    "path": "ferretui/ferretui/model/language_model/ferret_llama.py",
    "chars": 5716,
    "preview": "#    Copyright 2023 Haotian Liu\n#\n#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not "
  },
  {
    "path": "ferretui/ferretui/model/language_model/ferret_mpt.py",
    "chars": 3508,
    "preview": "#    Copyright 2023 Haotian Liu\n#\n#    Licensed under the Apache License, Version 2.0 (the \"License\");\n#    you may not "
  },
  {
    "path": "ferretui/ferretui/model/make_delta.py",
    "chars": 2260,
    "preview": "\"\"\"\nUsage:\npython3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~"
  },
  {
    "path": "ferretui/ferretui/model/multimodal_encoder/builder.py",
    "chars": 748,
    "preview": "import os\nfrom .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2\n\n\ndef build_vision_tower(vision_tower_cfg, **kwar"
  },
  {
    "path": "ferretui/ferretui/model/multimodal_encoder/clip_encoder.py",
    "chars": 8087,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig\n\n# Ad"
  },
  {
    "path": "ferretui/ferretui/model/multimodal_projector/builder.py",
    "chars": 1437,
    "preview": "import torch\nimport torch.nn as nn\nimport re\n\n\nclass IdentityMap(nn.Module):\n    def __init__(self):\n        super().__i"
  },
  {
    "path": "ferretui/ferretui/model/utils.py",
    "chars": 928,
    "preview": "from transformers import AutoConfig\n\n\ndef auto_upgrade(config):\n    cfg = AutoConfig.from_pretrained(config)\n    if 'lla"
  },
  {
    "path": "ferretui/ferretui/serve/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ferretui/ferretui/serve/cli.py",
    "chars": 4733,
    "preview": "import argparse\nimport torch\n\nfrom ferretui.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TO"
  },
  {
    "path": "ferretui/ferretui/serve/controller.py",
    "chars": 9955,
    "preview": "\"\"\"\nA controller manages distributed workers.\nIt sends worker addresses to clients.\n\"\"\"\nimport argparse\nimport asyncio\ni"
  },
  {
    "path": "ferretui/ferretui/serve/gradio_web_server.py",
    "chars": 18874,
    "preview": "import argparse\nimport datetime\nimport json\nimport os\nimport time\n\nimport gradio as gr\nimport requests\n\nfrom ferretui.co"
  },
  {
    "path": "ferretui/ferretui/serve/model_worker.py",
    "chars": 11215,
    "preview": "\"\"\"\nA model worker executes the model.\n\"\"\"\nimport argparse\nimport asyncio\nimport json\nimport time\nimport threading\nimpor"
  },
  {
    "path": "ferretui/ferretui/serve/register_worker.py",
    "chars": 734,
    "preview": "\"\"\"\nManually register workers.\n\nUsage:\npython3 -m fastchat.serve.register_worker --controller http://localhost:21001 --w"
  },
  {
    "path": "ferretui/ferretui/serve/sglang_worker.py",
    "chars": 8714,
    "preview": "\"\"\"\nA model worker executes the model.\n\"\"\"\nimport argparse\nimport asyncio\nfrom concurrent.futures import ThreadPoolExecu"
  },
  {
    "path": "ferretui/ferretui/serve/test_message.py",
    "chars": 2025,
    "preview": "import argparse\nimport json\n\nimport requests\n\nfrom ferretui.conversation import default_conversation\n\n\ndef main():\n    i"
  },
  {
    "path": "ferretui/ferretui/train/ferret_trainer.py",
    "chars": 14347,
    "preview": "import os\nimport torch\nimport torch.nn as nn\n\nfrom torch.utils.data import Sampler\n\nfrom transformers import Trainer\nfro"
  },
  {
    "path": "ferretui/ferretui/train/llama_flash_attn_monkey_patch.py",
    "chars": 4404,
    "preview": "from typing import Optional, Tuple\nimport warnings\n\nimport torch\n\nimport transformers\nfrom transformers.models.llama.mod"
  },
  {
    "path": "ferretui/ferretui/train/llama_xformers_attn_monkey_patch.py",
    "chars": 4916,
    "preview": "\"\"\"\nDirectly copied the code from https://raw.githubusercontent.com/oobabooga/text-generation-webui/main/modules/llama_a"
  },
  {
    "path": "ferretui/ferretui/train/train.py",
    "chars": 96249,
    "preview": "# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:\n# Adopted from tatsu-lab@stanford_al"
  },
  {
    "path": "ferretui/ferretui/train/train_mem.py",
    "chars": 118,
    "preview": "from ferretui.train.train import train\n\nif __name__ == \"__main__\":\n    train(attn_implementation=\"flash_attention_2\")\n"
  },
  {
    "path": "ferretui/ferretui/train/train_xformers.py",
    "chars": 372,
    "preview": "# Make it more memory efficient by monkey patching the LLaMA model with xformers attention.\n\n# Need to call this before "
  },
  {
    "path": "ferretui/ferretui/utils.py",
    "chars": 4006,
    "preview": "import datetime\nimport logging\nimport logging.handlers\nimport os\nimport sys\n\nimport requests\n\nfrom ferretui.constants im"
  },
  {
    "path": "ferretui/playground/sample_data/eval_data_example_0_box_in.json",
    "chars": 1538,
    "preview": "[\n    {\n        \"id\": 0,\n        \"image\": \"appstore_reminders.png\",\n        \"image_h\": 2532,\n        \"image_w\": 1170,\n  "
  },
  {
    "path": "ferretui/playground/sample_data/eval_data_example_1_no_box_in.json",
    "chars": 580,
    "preview": "[\n    {\n        \"id\": 0,\n        \"image\": \"appstore_reminders.png\",\n        \"image_h\": 2532,\n        \"image_w\": 1170,\n  "
  },
  {
    "path": "ferretui/playground/sample_data/train_data_example.json",
    "chars": 1901,
    "preview": "[\n    {\n        \"id\": 0,\n        \"image\": \"appstore_reminders.png\",\n        \"image_h\": 2532,\n        \"image_w\": 1170,\n  "
  },
  {
    "path": "ferretui/pyproject.toml",
    "chars": 1644,
    "preview": "[build-system]\nrequires = [\"setuptools>=61.0\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"ferretui\"\nvers"
  },
  {
    "path": "ferretui/scripts/eval/eval_UI.sh",
    "chars": 1031,
    "preview": "#!/bin/bash\nset -xe\n\nmodel_path='/path_to_Ferret-UI_model'\n\n# Example inference for referring tasks (ie bbox in input)\nC"
  },
  {
    "path": "ferretui/scripts/train/train_UI.sh",
    "chars": 1570,
    "preview": "#!/bin/bash\nset -xe\n\ntorchrun --nnodes 1 --nproc_per_node 8 ferretui/train/train_mem.py \\\n    --deepspeed ./scripts/zero"
  },
  {
    "path": "ferretui/scripts/zero2.json",
    "chars": 556,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "ferretui/scripts/zero3.json",
    "chars": 801,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "ferretui/scripts/zero3_offload.json",
    "chars": 1279,
    "preview": "{\n  \"fp16\": {\n    \"enabled\": \"auto\",\n    \"loss_scale\": 0,\n    \"loss_scale_window\": 1000,\n    \"initial_scale_power\": 16,\n"
  },
  {
    "path": "pyproject.toml",
    "chars": 1221,
    "preview": "[build-system]\nrequires = [\"setuptools>=61.0\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"ferret\"\nversio"
  },
  {
    "path": "scripts/extract_geosampler_and_mm_projector.py",
    "chars": 3185,
    "preview": "\"\"\"\nUsage:\n\n# 7B\nTo extract region_geo_sampler:\npython misc/extract_geosampler_and_mm_projector.py \\\n    --keys_to_match"
  },
  {
    "path": "scripts/verify_equal.py",
    "chars": 1837,
    "preview": "\"\"\"\nUsage:\npython3 misc/verify_equal.py \\\n    --orig-model-path ./checkpoints/ferret_ft_clipL336_vicunaV1-3-7b_3Ep_dataV"
  }
]

About this extraction

This page contains the full source code of the apple/ml-ferret GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 123 files (9.3 MB), approximately 2.4M tokens, and a symbol index with 572 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo