Repository: wayveai/LingoQA
Branch: main
Commit: 39d86b14b681
Files: 7
Total size: 20.8 KB

Directory structure:
gitextract_h_nus89n/

├── LICENCE
├── README.md
├── benchmark/
│   ├── constants.py
│   ├── demo.py
│   ├── evaluate.py
│   └── judge.py
└── requirements.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENCE
================================================
Wayve Dataset Licence Agreement for Non-Commercial Use (February 2024)

The Dataset (as defined below) is being made available by Wayve Technologies Ltd, a company incorporated in England and Wales with company number 10924127 (“Wayve”) to you (“You”) under the terms and conditions below.

By downloading or using the Dataset, You acknowledge that You have read these terms and conditions (“Terms"), understand them, and agree to be bound by the Terms. If Your institution or another entity exercises the rights granted under these Terms by virtue of Your download or use of the Dataset, You represent and warrant that: (a) You are authorised to agree to these Terms on behalf of the institution or other entity; or (b) You have confirmed that a person with authority to agree to these Terms on behalf of the institution or other entity has already so agreed on behalf of the institution or other entity. If You do not agree with the Terms or cannot warrant the above, You must not download and/or use the Dataset. 

In these Terms, the following expressions shall have the following meanings:

“Dataset" means Wayve’s LingoQA open dataset available at https://github.com/wayveai/LingoQA which has been developed and is owned by Wayve, and any computer code (whether in source or object code form) or other materials that may be made available by Wayve from time to time in conjunction with the dataset for the purposes of enabling You to benchmark or evaluate Your use of the dataset.

“Derivative IP" means any derivative work of the Dataset, any other work made or developed using the Dataset, and any invention conceived or reduced to practice, directly or indirectly, through the use of the Dataset.

“Intellectual Property Rights” means patents, rights in inventions, copyright and neighbouring and related rights, rights to use and protect the confidentiality of confidential information (including, but not limited to know-how and trade secrets), trade marks, goodwill, service marks, trade names, design rights, rights in get-up and trade dress, database rights, utility models, domain names, business names, rights in computer software, the right to sue for infringement, unfair competition and passing off, all similar rights of whatever nature wherever in the world arising, in each case, whether registered or not.

“Non-commercial Use" means use for academic or scientific research, teaching, scientific publication relating to academic or applied research, or personal experimentation. Non-commercial Use expressly excludes purposes intended for or directed towards commercial advantage or monetary compensation, whether directly or indirectly through sales, advertising revenue, subscriptions, through promoting commercial activities or developing or providing goods or services.

“Wayve" means Wayve Technologies Ltd, together with any and all of its affiliates from time to time.

“You" means the individual, institution or other entity exercising the rights granted under these Terms.

In consideration of Your agreement to and compliance with these Terms, Wayve grants to You a non-exclusive, non-transferrable, non-sublicensable, royalty-free, revocable, personal licence for the duration of these Terms to use the Dataset for Non-commercial Purposes only, subject to the following conditions:

Ownership: Wayve retains all right, title, and interest in and to any Dataset provided to You under these Terms. Except as expressly set out in these Terms, no Intellectual Property Rights or other rights in the Dataset are granted, assigned or otherwise transferred to You under these Terms. 

Attribution: You must provide the following attributions in any Derivative IP: “This work was made using the Wayve LingoQA dataset and benchmark, provided by Wayve Technologies Ltd under licence terms available at https://github.com/wayveai/LingoQA.” 

Feedback: If You give feedback, ideas or suggestions about the Dataset to Wayve or another person or entity acting on Wayve’s behalf (“Feedback”), You acknowledge and agree that Wayve has the unrestricted right to use, share and commercialise Your Feedback in any way and for any purpose, including for commercial purposes. This clause survives the termination or revocation of these Terms for whatever reason.

Compliance with law: You may not use the Dataset in any manner that violates applicable laws or regulations, including without limitation, applicable data protection laws.

Additional restrictions: You agree (a) not to distribute or publish or otherwise make available any models trained on or refined using the Dataset, or the parameters from such trained models, in whole or in part unless such model and/or its parameters are being distributed or published on an academic or scientific basis and free of charge; and (b) not to use or deploy the Dataset, any models trained on or refined using the Dataset, or the parameters from such trained models, in whole or in part, (i) in operation of a vehicle or robot or to assist in the operation of a vehicle or robot, or (ii) for any high-risk purpose, including without limitation any use that risks damage to property, personal injury or death.

Scope of licence: If You: (a) are concerned that Your intended use of the Dataset does not constitute Non-Commercial Use, or (b) would like to discuss a commercial licence, You must contact legal@wayve.ai and refrain from downloading or using the Dataset without Wayve’s prior written consent. Nothing in these Terms creates an obligation on Wayve to respond to such request or to grant you any commercial licence or other rights. 

Limited non-assert: In consideration for access to this Dataset, You (also on behalf of Your successors and assignees of any rights protecting Your Derivative IP) (a) agree not to prepare, initiate, assert, or otherwise support any claim against Wayve or any of its affiliates, successors, or assignees, for infringement, misappropriation or other violation of any rights protecting Your Derivative IP, and (b) grant Wayve and the other parties in (a) above a worldwide and unrestricted licence to such rights, with such licence becoming effective only if You breach the obligations of (a).

Disclaimer and limitation of liability: You acknowledge and agree the Dataset is provided “as-is" and without any express or implied warranty, including without limitation warranties of accuracy, completeness, reliability, title, merchantability, fitness for a particular purpose, and non-infringement. To the fullest extent permitted by law, in no event shall Wayve or its affiliates, partners, licensors, customers, employees or contractors be liable under any legal theory with respect to the Dataset or any Derivative IP (or any use thereof), or for any direct, indirect, consequential, exemplary, incidental, punitive, or special damages, lost profits, wasted expenditure, harm to reputation or loss of goodwill and/or loss of business or data (regardless of whether such liability arises in tort, contract or in any other way and whether or not caused by negligence or misrepresentation).

Revocation of licence: The licence granted under these Terms may be revoked by Wayve in its sole discretion at any time by posting a notice of revocation on https://github.com/wayveai/LingoQA. 

No updates: You understand and agree that Wayve is under no obligation to provide maintenance services, update services, notices of latent defects, or corrections of defects with regard to the Dataset. Wayve nevertheless reserves the right to update, modify, or discontinue the Dataset at any time.

Termination for breach: If You fail to comply with these Terms, then any rights granted to You hereunder terminate automatically and You agree to remove access to and delete the Dataset.

Indemnification: You shall fully indemnify and keep indemnified and hold Wayve harmless from and against any losses, claims, damages, liability, costs (including legal and other professional fees), fines and expenses incurred by Wayve as a result of or in connection with Your use of the Dataset or breach of these Terms.

Assignment: You may not assign these Terms or any rights or obligations hereunder, except with Wayve’s express written consent. Any attempted assignment in violation of this section will be void. Wayve may assign its rights and obligations hereunder without written notice to You. 

No partnership or agency: Nothing in these Terms constitutes, or shall be deemed to constitute, a partnership between the parties nor make any party the agent of another party.

Severance: If any provision of these Terms (or part of any provision) is or becomes illegal, invalid or unenforceable but would be legal, valid and enforceable if some part of it was deleted or modified, the provision or part-provision in question shall apply with such deletions or modifications as may be necessary to make the provision legal, valid and enforceable.

Waiver: No failure, delay or omission by Wayve in exercising any right, power or remedy provided by law or under these Terms shall operate as a waiver of that right, power or remedy, nor shall it preclude or restrict any future exercise of that or any other right or remedy.

Governing law and jurisdiction: These Terms shall be governed by and construed in accordance with the laws of England. Any disputes arising from or in connection with these Terms shall be subject to the exclusive jurisdiction of the courts of England and Wales.

================================================
FILE: README.md
================================================
![banner](assets/banner.png)


[ECCV 2024] Official GitHub repository for "LingoQA: Visual Question Answering for Autonomous Driving", presenting the LingoQA benchmark, dataset and baseline model for autonomous driving Visual Question Answering (VQA). 

[[preprint]](https://github.com/wayveai/LingoQA/blob/main/assets/preprint.pdf)[[arxiv]](https://arxiv.org/abs/2312.14115)[[huggingface]](https://huggingface.co/papers/2312.14115)


## Overview <a name="overview"></a> 

In this repository you will find:
- A summary of the LingoQA dataset and evaluation metric
- An example of how to run the benchmark on your model predictions
- Details about how to download the datasets
- An example of how to run the novel evaluation metric, Lingo-Judge

## 5-minute summary <a name="summary"></a> 

[<img src="assets/thumbnail.png" width="100%">](https://www.canva.com/design/DAF-vlMT8vo/X7ynk_nv52t7jE7UpKlRBg/view?utm_content=DAF-vlMT8vo&utm_campaign=designshare&utm_medium=link&utm_source=recording_view)

## Benchmark
To run the LingoQA benchmark on your predictions, simply install the requirements for the repository:

```
pip install -r ./requirements.txt
```

Export the predictions of your model to a ```predictions.csv``` file with the columns ```question_id```, ```segment_id``` and ```answer``` corresponding to the [evaluation dataset](https://drive.google.com/drive/folders/1ivYF2AYHxDQkX5h7-vo7AUDNkKuQz_fL/evaluation).
You should have 500 answers in the file. Subsequently, run the benchmark using the following command:

```
python ./benchmark/evaluate.py --predictions_path ./path_to_predictions/predictions.csv
```

## Download Data and Annotations <a name="dataset"></a> 
The LingoQA training and evaluation datasets contain:
- Videos corresponding to driving scenarios
- Language annotations

| Datset                    | Split  | Videos | QA Pairs | QA Per Scenario |            Link         |
|---------------------------|--------|--------|----------|-----------------|-------------------------|
| Scenery                   | Train  |  3.5k  |  267.8k  |     ~10.9       |   [https://drive.google.com/drive/folders/1ivYF2AYHxDQkX5h7-vo7AUDNkKuQz_fL/scenery](https://drive.google.com/drive/folders/1GiwWGfrM8pO27CYLu_9Uwtdcz0JoqHr7)   |
| Action                    | Train  | 24.5k  |  152.5k  |     ~43.6       |   [https://drive.google.com/drive/folders/1ivYF2AYHxDQkX5h7-vo7AUDNkKuQz_fL/action](https://drive.google.com/drive/folders/1QQqBrR3uGDC05Zc11zMeui6Zzl7RvFZg)   |
| Evaluation                | Test   |   100  |   1000   |       10        |  [https://drive.google.com/drive/folders/1ivYF2AYHxDQkX5h7-vo7AUDNkKuQz_fL/evaluation](https://drive.google.com/drive/folders/1oA7W8-Ej_uJEuUxZIjPP5K8hQGGzYsPq)   |


## Evaluation Metric <a name="dataset"></a> 

Lingo-Judge is an evaluation metric that aligns closely with human judgement on the LingoQA evaluation suite. 

```
# Import necessary libraries
from transformers import pipeline

# Define the model name to be used in the pipeline
model_name = 'wayveai/Lingo-Judge'

# Define the question and its corresponding answer and prediction
question = "Are there any pedestrians crossing the road? If yes, how many?"
answer = "1"
prediction = "Yes, there is one"

# Initialize the pipeline with the specified model, device, and other parameters
pipe = pipeline("text-classification", model=model_name)
# Format the input string with the question, answer, and prediction
input = f"[CLS]\nQuestion: {question}\nAnswer: {answer}\nStudent: {prediction}"

# Pass the input through the pipeline to get the result
result = pipe(input)

# Print the result and score
score = result[0]['score']
print(score > 0.5, score)

```

## Citation <a name="citation"></a> 

If you find our work useful in your research, please consider citing:

```bibtex
@article{marcu2023lingoqa,
  title={LingoQA: Visual Question Answering for Autonomous Driving}, 
  author={Ana-Maria Marcu and Long Chen and Jan Hünermann and Alice Karnsund and Benoit Hanotte and Prajwal Chidananda and Saurabh Nair and Vijay Badrinarayanan and Alex Kendall and Jamie Shotton and Oleg Sinavski},
  journal={arXiv preprint arXiv:2312.14115},
  year={2023},
}
```


================================================
FILE: benchmark/constants.py
================================================
"""
LingoQA datasets are stored in Google Cloud.
This file provides the download link for the datasets, as well as reference keys for the data.
"""
from enum import Enum

LINGOQA_TEST = "https://drive.usercontent.google.com/u/1/uc?id=1I8u6uYysQUstoVYZapyRQkXmOwr-AG3d&export=download"

LINGO_JUDGE = "wayveai/Lingo-Judge"

class Keys(str, Enum):
    question_id = "question_id"
    segment_id = "segment_id"
    question = "question"
    answer = "answer"
    references = "references"
    prediction = "prediction"
    max_score = "max_score"
    score = "score"
    probability = "probability"
    correct = "correct"

================================================
FILE: benchmark/demo.py
================================================
# Demo for using the Lingo-Judge directly from Hugging Face
from transformers import pipeline

# Define the model name to be used in the pipeline
model_name = 'wayveai/Lingo-Judge'
# Define the question and its corresponding answer and prediction
question = "Are there any pedestrians crossing the road? If yes, how many?"
answer = "1"
prediction = "Yes, there is one"
# Initialize the pipeline with the specified model, device, and other parameters
pipe = pipeline("text-classification", model=model_name)
# Format the input string with the question, answer, and prediction
input = f"[CLS]\nQuestion: {question}\nAnswer: {answer}\nStudent: {prediction}"
# Pass the input through the pipeline to get the result
result = pipe(input)
# Print the result and score
score = result[0]['score']
print(score > 0.5, score)

================================================
FILE: benchmark/evaluate.py
================================================
import click
import torch
import pandas as pd

from datasets import Dataset
from functools import partial 
from constants import LINGOQA_TEST, Keys
from judge import LingoJudge


@click.command()
@click.option('--predictions_path', help='Path to predictions file.')
@click.option('--batch_size', help='Batch size for evaluation.', default=1)
def evaluate(predictions_path: str, batch_size: int) -> float:
    """
    Simple script for running evaluation on the LingoQA benchmark.

    Args:
        predictions_path: path to a .csv file containing the model predictions.
        batch_size: batch size for evaluation.
    Out:
        benchmark_score: evaluation score obtained from running the textual classifier on the benchmark.
    """
    references = pd.read_parquet(LINGOQA_TEST)
    references = references[[Keys.question_id, Keys.segment_id, Keys.question, Keys.answer]]
    references = references.groupby([Keys.question_id, Keys.segment_id, Keys.question]).agg(list)
    references = references.rename({Keys.answer: Keys.references}, axis=1)
    print(f"Loaded {len(references)} references.")

    predictions = pd.read_csv(predictions_path)
    predictions = predictions.rename({Keys.answer: Keys.prediction}, axis=1)
    print(f"Loaded {len(predictions)} predictions.")
     
    merged = pd.merge(predictions, references, on=[Keys.question_id, Keys.segment_id])
    print(f"Matched {len(merged)} predictions with references.")
    if len(merged) != 500:
        print("WARNING! You are evaluating on a subset of the LingoQA benchmark. Please check your input file for missing or mis-matched examples.")

    dataset = Dataset.from_pandas(merged)

    judge = LingoJudge().eval().to("cuda:0")
    dataset_evaluated = dataset.map(partial(evaluate_question, judge), batched=True, batch_size=batch_size)
    dataset_filtered = dataset_evaluated.filter(select_correct)

    benchmark_score = dataset_filtered.num_rows/dataset_evaluated.num_rows
    print(f"The overall benchmark score is {benchmark_score*100}%")
    return benchmark_score


def evaluate_question(metric: LingoJudge, data_dict: dict) -> dict:
    """
    Run evaluation for a batch of questions.

    Args:
        metric: the evaluation metric for computing the scores.
        data_dict: the data dictionary containing questions, references, and predictions.

    Out:
        data_dict: updated data dictionary containing information such as
        the maximum score, the probability of correctness, and a boolean
        indicating whether the prediction is correct or not.
    """
    questions = data_dict[Keys.question]
    references = data_dict[Keys.references]
    prediction = data_dict[Keys.prediction]

    scores = metric.compute(questions, references, prediction)

    data_dict[Keys.score] = scores
    data_dict[Keys.probability] = torch.sigmoid(scores)
    data_dict[Keys.correct] = scores > 0.0
    return data_dict


def select_correct(data_dict: dict) -> bool:
    """
    Filtering function for selecting the predictions classified as correct.
    """
    return data_dict[Keys.correct]


if __name__=="__main__":
    _ = evaluate()

================================================
FILE: benchmark/judge.py
================================================
import torch
from torch import nn

from tqdm import trange
from typing import List
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from constants import LINGO_JUDGE, Keys


class LingoJudge(nn.Module):
    """
    LingoJudge is a textual classifier that evaluates the truthfulness of an answer on the LingoQA benchmark.
    """
    def __init__(self, pretrained_model=LINGO_JUDGE):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model, use_fast=True)
        self.model = AutoModelForSequenceClassification.from_pretrained(pretrained_model).eval()

    @torch.inference_mode()
    def forward(self, question: str, references: List[str], prediction: str):
        """
        Inference function for textual classifier with multiple reference answers.
        Args:
            question: Input question.
            references: List of references.
            prediction: Model prediction.
        Output:
            scores: Score indicating truthfulness.
        """
        device = next(self.parameters()).device
        texts = [
            f"{self.tokenizer.cls_token}\nQuestion: {question}\nAnswer: {a_gt}\nStudent: {prediction}"
            for a_gt in references
        ]

        encoded_input = self.tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=128)
        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
        output = self.model(**encoded_input)
        scores = output.logits.squeeze(-1)
        return scores

    def compute(self, questions: List[str], references: List[List[str]], predictions: List[str]):
        """
        Compute maximum classifier metric. For multiple reference answers, selects the highest one.
        Args:
            questions: List of input questions.
            references: List of lists, with multiple references per question supported.
            predictions: List of model predictions.
        Output:
            scores: Score indicating truthfulness. 
        """
        max_scores = []

        for index, question in enumerate(questions):
            references_preprocessed = [self.preprocess(reference) for reference in references[index]]
            prediction_preprocessed = self.preprocess(predictions[index])
            scores = self.forward(question, references_preprocessed, prediction_preprocessed)
            max_score = [max(scores)]
            max_scores.extend(max_score)
        return torch.Tensor(max_scores)
        
    def preprocess(self, string: str):
        """
        Preprocessing function for consistency. 
        Args:
            string: input string to be processed.
        Output:
            output: processed string with lower cases and trailing lines removed.
        """
        output = str(string).lower().lstrip().rstrip() 
        return output

================================================
FILE: requirements.txt
================================================
sentencepiece==0.1.97
pytorch-lightning==2.1.2
transformers==4.36.2
datasets==2.16.1
pandas==2.0.3
protobuf==4.25.1
StrEnum==0.4.15
click==8.1.7
tqdm==4.66.1