Repository: Alrope123/rethinking-demonstrations
Branch: main
Commit: d8267faf528b
Files: 5
Total size: 43.2 KB

Directory structure:
gitextract_6owm_f3k/

├── README.md
├── create_data.py
├── gpt3.py
├── templates.py
└── test_gpt3.py

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

This includes an original implementation of "[Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?][paper]" by [Sewon Min][sewon], [Xinxi Lyu][xinxi], [Ari Holtzman][ari], [Mikel Artetxe][mikel], [Mike Lewis][mike], [Hannaneh Hajishirzi][hanna], and [Luke Zettlemoyer][luke].

This code provides:
- Codes for creating the variants of the demonstrations used in the experiments.
- Commands to run the models and get numbers reported in the paper, based on the [MetaICL][metaicl] codebase.

Please leave issues for any questions about the paper or the code.

If you find our code or paper useful, please cite the paper:
```
@inproceedings{ min2022rethinking,
    title={ Rethinking the Role of Demonstrations: What makes In-context Learning Work? },
    author={ Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
    booktitle={ EMNLP },
    year={ 2022 }
}
```

### Announcements
* 07/25/2022: The code also supports running GPT-3 now.
* 02/25/2022: The code supports running GPT-2, MetaICL and GPT-J for now. Please contact authors for running other models.

## Content

1. [Preparation](#preparation)
2. [Reproducing Main Experiments](#reproducing-main-experiments) (Section 4.1 of the paper)
    * [No Demonstrations](#no-demonstrations)
    * [Demonstrations with gold labels](#demonstrations-with-gold-labels)
    * [Demonstrations with random labels](#demonstrations-with-random-labels)
3. [Reproducing Ablations](#reproducing-ablations) (Section 4.2 of the paper)
    * [Number of correct labels](#number-of-correct-labels)
    * [Number of input-label pairs in the demonstrations](#number-of-input-label-pairs-in-the-demonstrations)
    * [Using manual templates](#using-manual-templates)
4. [Reproducing Analysis](#reproducing-analysis) (Section 5 of the paper)
    * [Demonstrations with OOD input text](#demonstrations-with-ood-input-text)
    * [Demonstrations with random english words](#demonstrations-with-random-english-words)
    * [Demonstrations with random labels only (no inputs)](#demonstrations-with-random-labels-only-no-inputs)
    * [Demonstrations with no labels (inputs only)](#demonstrations-with-no-labels-inputs-only)

## Preparation

The code is tested with python 3.8.

The data and the code are based on the MetaICL codebase.
```bash
git remote add metaicl https://github.com/facebookresearch/MetaICL.git
git pull metaicl main --allow-unrelated-histories -X ours
```

Install the data dependencies and download the data.
```bash
conda conda create --name metaicl-data python=3.8
conda activate metaicl-data
pip install datasets==1.4.0 wget
cd preprocess
python _build_gym.py --build --n_proc=40 --do_test
```

This uses `k=16` by default. If you want to run ablations with varying `k`, please also run the following.
```bash
python _build_gym.py --build --n_proc=40 --do_test --test_k {4|8|32}
```

After preprocesisng is done, come back to the main directory.
```bash
cd ../
conda deactivate
```

Now, install the model dependencies to run the model. Please note that the Transformer version is not compatible to the datasets library used to download the data, so make sure to use a different environment.
```
conda conda create --name metaicl python=3.8
conda activate metaicl
pip install torch==1.9.0
pip install git+https://github.com/huggingface/transformers.git@c37573806ab3526dd805c49cbe2489ad4d68a9d7
```

(Optional) Install OpenAI Python Library for running GPT-3
```
pip install openai
```

## Reproducing Main Experiments

This is for reproducing experiments in Section 4.1 of the paper.
Evaluation datasets are:
* Classification (16 datasets): `financial_phrasebank`,`poem_sentiment`,`glue-wnli`,`climate_fever`,`glue-rte`,`superglue-cb`,`sick`,`medical_questions_pairs`,`glue-mrpc`,`hate_speech18`,`ethos-national_origin`,`ethos-race`,`ethos-religion`,`tweet_eval-hate`,`tweet_eval-stance_atheism`,`tweet_eval-stance_feminist`
* Multi-choice (10 datasets): `quarel`,`openbookqa`,`qasc`,`commonsense_qa`,`ai2_arc`,`codah`,`superglue-copa`,`dream`,`quartz-with_knowledge`,`quartz-no_knowledge`

#### No Demonstrations

To run the evaluation of No-Demonstrations:

```bash
# Direct GPT-2 Large
python test.py --dataset {dataset} --gpt2 gpt2-large --method direct --out_dir out/gpt2-large --do_zeroshot
# Channel GPT-2 Large
python test.py --dataset {dataset} --gpt2 gpt2-large --method channel --out_dir out/gpt2-large --do_zeroshot
# Direct MetaICL
python test.py --dataset {dataset} --gpt2 metaicl --method direct --out_dir out/metaicl --do_zeroshot
# Channel MetaICL
python test.py --dataset {dataset} --gpt2 channel-metaicl --method channel --out_dir out/channel-metaicl --do_zeroshot
# Direct GPT-J
python test.py --dataset {dataset} --gpt2 gpt-j-6B --method direct --out_dir out/gpt-j --do_zeroshot
# Channel GPT-J
python test.py --dataset {dataset} --gpt2 gpt-j-6B --method channel --out_dir out/gpt-j --do_zeroshot
# GPT-3
python test_gpt3.py --dataset {dataset} --gpt3 {ada|babbage|curie|davinci} --method {direct|channel} --out_dir out/gpt3 --do_zeroshot --api {API key}
```
Note that `test.py` and `test_gpt3.py` does not support multi-gpu for inference.

Other useful flags:
* `--test_batch_size`: can be adjusted based on your GPU memory availability. With a 32GB GPU, you can use 64 for GPT-2 Large & MetaICL, and 16 for GPT-J **with no demonstrations**. Later, when you run the code **with demonstrations**, decreasing the batch size by 4 times typically works, e.g., 16 (GPT-2 Large & MetaICL) and 4 (GPT-J) with a 32GB GPU.
* `--log_file`: if you want to save logs in a file, you can specify the path to the log file.

Notes for running GPT-3:
* You can create/check your OpenAI API keys by visiting [this link](https://beta.openai.com/account/api-keys).
* Running with GPT-3 can be expensive, and different models of GPT-3 comes with different costs. Please check [this link](https://openai.com/api/pricing/) to evaluate the cost before running each experiment.
* The responses from the GPT-3 API are cached in the `out_dir`.

From now on, we will use the above commands as a default and tell you which flags you need to add.


#### Demonstrations with gold labels

Run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87`.

#### Demonstrations with random labels

Create the demonstrations with random labels via:
```bash
python create_data.py --variant random --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_random`.

## Reproducing Ablations

This is for reproducing experiments in Section 4.2 of the paper.
Evaluation datasets are:
* Classification (5 datasets): `poem_sentiment`,`glue-rte`,`sick`,`glue-mrpc`,`tweet_eval-hate`
* Multi-choice (4 datasets): `openbookqa`,`commonsense_qa`,`ai2_arc`,`superglue-copa`

#### Number of correct labels

Create the demonstrations with varying number of correct labels via:
```bash
python create_data.py --variant {75|50|25|0}_correct --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_{75|50|25|0}_correct`.

#### Number of input-label pairs in the demonstrations

(Note that you should have run preprocessing with varying `k` to run this ablation. If you have not done this, please re-visit the [Preparation](#preparation) section.)

Create the demonstrations with varying `k` via:
```bash
python create_data.py --variant random --dataset {dataset} --k {4|8|16|32}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k {4|8|16|32} --seed 100,13,21,42,87 --dataset {dataset}_random`.

#### Using manual templates

Create the demonstrations with varying type of labels and inference method via:
```bash
python create_data.py --variant {gold|random}_w_template --dataset {dataset} --method {direct|channel}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_{gold|random}_w_template_{direct|channel}`.

## Reproducing Analysis

This is for reproducing experiments in Section 5 of the paper.
Evaluation datasets are:
* Classification (5 datasets): `poem_sentiment`,`glue-rte`,`sick`,`glue-mrpc`,`tweet_eval-hate`
* Multi-choice (4 datasets): `openbookqa`,`commonsense_qa`,`ai2_arc`,`superglue-copa`

#### Demonstrations with OOD input text

First, you need a corpus file in a .txt format, where each line is a sentence (in the plain text).
In the paper, we used samples from the English portion of CC News, which we are unable to release here.
Please visit [this link](https://commoncrawl.org/2016/10/news-dataset-available/) to learn more about how to download the CC News corpus.

Create the demonstrations with OOD input text via:
```bash
python create_data.py --variant ood_inputs --dataset {dataset} --corpus_path {corpus_path}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_ood_inputs`.

#### Demonstrations with random english words

Create the demonstrations with random English words as labels via:
```bash
python create_data.py --variant random_english_words --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed {seed} --dataset {dataset}_random_english_words_seed={seed}`, where `seed` can be one of 100, 13, 21, 42, and 87.

#### Demonstrations with random labels only (no inputs)

Create the demonstrations with random labels only via:
```bash
python create_data.py --variant random_labels_only --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_random_labels_only`.

#### Demonstrations with no labels (inputs only)

Create the demonstrations with no labels via:
```bash
python create_data.py --variant no_labels --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_no_labels`.


[paper]: https://arxiv.org/abs/2202.12837
[sewon]: http://shmsw25.github.io/
[xinxi]: https://alrope123.github.io/
[ari]: https://ari-holtzman.github.io/
[mikel]: https://scholar.google.com/citations?user=N5InzP8AAAAJ&hl=en
[mike]: https://ai.facebook.com/people/mike-lewis/
[hanna]: https://homes.cs.washington.edu/~hannaneh/index.html
[luke]: https://www.cs.washington.edu/people/faculty/lsz

[metaicl]: https://github.com/facebookresearch/MetaICL


================================================
FILE: create_data.py
================================================
import os
import argparse
import random
import json
import numpy as np

from collections import defaultdict, Counter

from templates import apply_template

def main(args):
    assert args.variant in [
        "gold", "random", # main experiments in Section 4
        "75_correct", "50_correct", "25_correct", "0_correct", # ablations in Section 4
        "gold_w_template", "random_w_template", # ablations in Section 4
        "ood_inputs", "random_english_words", "random_labels_only", "no_labels", # Section 5
        "random_english_words_gold_labels", "permutated_labels", "random_true_distribution"
    ]
    if args.variant in ["gold_w_template", "random_w_template"]:
        assert args.method is not None, "Please specify `--method` with the inference method (`direct` or `channel`) for using the template."
        assert args.method in ["direct", "channel"], "Please make sure to use either `direct` or `channel`."

    if args.variant=="gold":
        print ("No need to run `create_data.py` --- you can use the original data as it is.")
        return

    if args.variant=="ood_inputs":
        # load sources for OOD inputs
        assert args.corpus_path is not None, \
        """
        Please note that you need to specify the path to the corpus from which the OOD inputs will be sampled.
        It should be a .txt file where each line contains a sentence (plain text).
        """
        grouped_samples = defaultdict(list)
        with open(args.corpus_path, "r") as f:
            random_texts = []
            random_text_lens = []
            for line in f:
                line = line.strip()
                random_texts.append(line)
                random_text_lens.append(len(line.split()))
            random_text_lens = np.array(random_text_lens)

    elif args.variant in ["random_english_words", "random_english_words_gold_labels"]:
        from english_words import english_words_set
        english_words_set = sorted(english_words_set)

    datasets = args.dataset.split(',')
    new_datasets = [dataset + "_" + args.variant + (("_" + args.method) if args.method is not None else "") for dataset in datasets]
    assert len(datasets) == len(new_datasets)

    ################################################################################################################

    seeds = args.seed.split(',')
    perfs = []
    for dataset_idx, (dataset, new_dataset) in enumerate(zip(datasets, new_datasets)):

        # contruct and save a new config file and data directory
        config_file = os.path.join(args.config_dir, "tasks")
        assert os.path.exists(config_file), config_file
        with open(os.path.join(config_file, "{}.json".format(dataset)), "r") as f:
            config = json.load(f)

        # in case of random English words, we will create a config file and data directory
        # for each random seed later on (since the data is different across seeds)
        if args.variant not in ["random_english_words", "random_english_words_gold_labels"]:
            with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f:
                json.dump(config, f)

            new_dataset_dir = os.path.join(args.data_dir, new_dataset)
            if not os.path.exists(new_dataset_dir):
                os.mkdir(new_dataset_dir)
        
        # load full training data to get the distribution of the labels
        if args.variant=="random_true_distribution":
            full_train_data_path = os.path.join(args.data_dir, dataset, "{}_16384_100_train.jsonl".format(dataset))
            assert os.path.exists(full_train_data_path), "Please generate full training data first by running _build_gym.py with k=16384."
            full_train_data_labels = []
            with open(full_train_data_path, "r") as f:
                for line in f:
                    dp = json.loads(line)
                    assert dp["task"]==dataset
                    full_train_data_labels.append(dp["output"])
            train_label_counter = Counter(full_train_data_labels)
            train_label_distribution = {label : train_label_counter[label] / len(full_train_data_labels) for label in train_label_counter}

        for seed in seeds:
            # random seed
            np.random.seed(int(seed))

            if args.variant in ["random_english_words", "random_english_words_gold_labels"]:
                new_dataset = new_datasets[dataset_idx] + "_seed={}".format(seed)

            # read the original training and test data
            # note that we are modifying the training data only,
            # and the test data will always be the same
            # (we are creating duplicates only for convenience)
            train_data = []
            train_data_path = os.path.join(args.data_dir, dataset, "{}_{}_{}_{}.jsonl".format(dataset, args.k, seed, "train"))
            with open(train_data_path, "r") as f:
                for line in f:
                    dp = json.loads(line)
                    assert dp["task"]==dataset
                    dp["task"] = new_dataset
                    train_data.append(dp)

            test_data = []
            test_data_path = os.path.join(args.data_dir, dataset, "{}_{}_{}_{}.jsonl".format(dataset, args.k, seed, "test"))
            with open(test_data_path, "r") as f:
                for line in f:
                    dp = json.loads(line)
                    assert dp["task"]==dataset
                    dp["task"] = new_dataset
                    test_data.append(dp)

            # apply templates to inputs and labels
            if args.variant in ["gold_w_template", "random_w_template"]:
                for dp in train_data:
                    apply_template(dp, dataset, args.method)
                for dp in test_data:
                    apply_template(dp, dataset, args.method)

            # now, for random_english_words, create a config file and data directory
            if args.variant in ["random_english_words", "random_english_words_gold_labels"]:
                new_dataset_dir = os.path.join(args.data_dir, new_dataset)
                if not os.path.exists(new_dataset_dir):
                    os.mkdir(new_dataset_dir)

                if config["task_type"]=="classification":
                    new_options = list(np.random.choice(english_words_set, size=len(config["options"]), replace=False))
                    new_mapping = {option: new_option for option, new_option in zip(config["options"], new_options)}
                    new_config = config.copy()
                    new_config["options"] = new_options

                    with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f:
                        json.dump(new_config, f)

                    for i, dp in enumerate(train_data):
                        train_data[i]["output"] = new_mapping[dp["output"]]
                        train_data[i]["options"] = [new_mapping[option] for option in dp["options"]]

                    if args.variant == "random_english_words_gold_labels":
                        # also modify the test data for classification tasks
                        for i, dp in enumerate(test_data):
                            test_data[i]["output"] = new_mapping[dp["output"]]
                            test_data[i]["options"] = [new_mapping[option] for option in dp["options"]]

                elif config["task_type"]=="multi-choice":
                    with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f:
                        json.dump(config, f)

                    shuffled_indices = np.random.permutation(range(len(english_words_set)))
                    shuffled_options = [english_words_set[i] for i in shuffled_indices]
                    offset = 0
                    for i, dp in enumerate(train_data):
                        new_options = shuffled_options[offset:offset+len(dp["options"])]
                        offset += len(dp["options"])
                        train_data[i]["output"] = new_options[dp["options"].index(dp["output"])]
                        train_data[i]["options"] = new_options

                else:
                    raise NotImplementedError()

            # modify both train input and test input for permutated_labels with classification tasks
            if args.variant == "permutated_labels" and config["task_type"]=="classification":
                old_options = config["options"]
                new_options = [old_options[(i+1)%len(old_options)] for i in range(len(old_options))]
                new_mapping = {old_option: new_option for old_option, new_option in zip(old_options, new_options)}

                for i, dp in enumerate(train_data):
                    train_data[i]["output"] = new_mapping[dp["output"]]                    
                for i, dp in enumerate(test_data):
                    test_data[i]["output"] = new_mapping[dp["output"]]
                    

            ## modify labels in the training data

            if args.variant in ["75_correct", "50_correct", "25_correct"]:
                num_correct = args.k * int(args.variant.split("_")[0]) // 100
                indices_correct = np.random.permutation(range(args.k))[:num_correct]

            for dp_idx, dp in enumerate(train_data):
                if args.variant in ["gold", "gold_w_template", "permutated_labels", "random_english_words_gold_labels"] or \
                        (args.variant in ["75_correct", "50_correct", "25_correct"] and dp_idx in indices_correct):
                    # assign correct label
                    pass
                elif args.variant.endswith("_correct"):
                    # assign incorrect label
                    dp["output"] = dp["options"][np.random.choice([i for i in range(len(dp["options"])) if dp["options"][i] != dp["output"]])]
                elif args.variant=="no_labels":
                    # assign empty label
                    dp["output"] = ""
                    dp["options"] = [""]
                elif args.variant=="random_true_distribution":
                    # assign random labels according to the distribution in the training data
                    dp["output"] = np.random.choice(list(train_label_distribution.keys()), p=list(train_label_distribution.values()))
                else:
                    # assign random label
                    dp["output"] = np.random.choice(dp["options"])

            ## modify inputs in the training data

            if args.variant=="random_labels_only":
                for dp in train_data:
                    dp["input"] = ""

            elif args.variant=="ood_inputs":
                new_train_data = []
                for dp in test_data:
                    l = len(dp["input"].split())
                    prob = np.exp(-np.power(random_text_lens-l, 2)/50)
                    prob /= np.sum(prob)
                    samples = np.random.choice(random_texts, size=args.k, replace=False, p=prob)
                    assert len(samples)==len(train_data)
                    new_train_data.append([])
                    for train_dp, sample in zip(train_data, samples):
                        new_train_data[-1].append({"task": train_dp["task"],
                                                    "input": sample,
                                                    "output": train_dp["output"],
                                                    "options": train_dp["options"]})
                train_data = new_train_data

            # write the modified data
            with open(os.path.join(new_dataset_dir, "{}_{}_{}_{}.jsonl".format(new_dataset, args.k, seed, "train")), "w") as f:
                for dp in train_data:
                    f.write(json.dumps(dp))
                    f.write("\n")

            with open(os.path.join(new_dataset_dir, "{}_{}_{}_{}.jsonl".format(new_dataset, args.k, seed, "test")), "w") as f:
                for dp in test_data:
                    f.write(json.dumps(dp))
                    f.write("\n")

            print ("Done for %s seed=%s" % (new_dataset, seed))


if __name__=='__main__':

    parser = argparse.ArgumentParser()

    parser.add_argument("--dataset", type=str, default=None)
    parser.add_argument("--k", type=int, default=16)
    parser.add_argument("--seed", type=str, default="100,13,21,42,87")
    parser.add_argument("--variant", type=str, default="random", required=True)
    parser.add_argument("--method", type=str, default=None)

    parser.add_argument("--data_dir", type=str, default="data")
    parser.add_argument("--config_dir", type=str, default="config")
    parser.add_argument("--corpus_path", type=str, default=None)

    args = parser.parse_args()

    main(args)


================================================
FILE: gpt3.py
================================================
import time
import sys
import numpy as np
import torch
import json
import openai
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

from transformers import GPT2Tokenizer

class GPT3Model(object):

    def __init__(self, model_name, api_key, logger=None):
        self.model_name = model_name
        try:
            openai.api_key = api_key
        except Exception:
            pass
        self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl")
        self.logger=logger


    def prepare_data(self, train_data, test_data, method, batch_size=10, dp_sep="\n", max_length=1024):
        # format demonstrations
        demonstrations = ""
        for dp in train_data:
            if method=="direct":
                demonstrations += "{}{}{}\n\n\n".format(dp["input"], dp_sep, dp["output"])
            elif method=="channel":
                demonstrations += "{}{}{}\n\n\n".format(dp["output"], dp_sep, dp["input"])
            else:
                raise NotImplementedError()

        # append demonstrations and separate options
        inputs = []
        outputs = []
        metadata = []
        for dp in test_data:
            prompt = dp["input"]
            options = dp["options"]

            indices = [i for i in range(len(inputs), len(inputs) + len(options))]
            metadata.append({"indices": indices, "options": options})

            if method=="direct":
                inputs += [demonstrations + prompt + dp_sep for option in options]
                outputs += [option for option in options]
            elif method=="channel":
                inputs += [demonstrations + option + dp_sep for option in options]
                outputs += [prompt for option in options]
            else:
                raise NotImplementedError()

        # truncate inputs
        for i, (inp, out) in enumerate(zip(inputs, outputs)):
            input_ids = self.tokenizer.encode(inp)
            output_ids = self.tokenizer.encode(out)
            if (len(input_ids) + len(output_ids) > max_length):
                input_ids = input_ids[len(input_ids)+len(output_ids) - max_length:]
                assert len(input_ids)+len(output_ids) == max_length
            inputs[i] = self.tokenizer.decode(input_ids)

        if self.logger is not None:
            self.logger.info("Checking the first example...")
            self.logger.info(inputs[0] + "" + outputs[0])

        # construct a dataloader
        dataset = zip(inputs, outputs)
        input_chunks = [inputs[i : i + batch_size] for i in range(0, len(inputs), batch_size)]
        output_chunks = [outputs[i : i + batch_size] for i in range(0, len(outputs), batch_size)]
        dataloader = [(input_chunks[i], output_chunks[i]) for i in range(0, len(input_chunks))]

        return dataloader, metadata


    def do_inference(self, dataloader):
        losses = []
        cache = []
        cost = 0
        for inputs, outputs in dataloader:
            data = [inp + out for inp, out in zip(inputs, outputs)]
            response = self.gpt3(data)
            for choice in response["choices"]:
                cost += len(choice["logprobs"]["tokens"]) * 0.00006
            # print("current cost = " + str(cost))
            cache.append((data, response))
            # get the beginning of the target from the response (based on tokenization)
            for inp, outp, out in zip(inputs, outputs, response["choices"]):
                assert inp+outp==out["text"]
                i = 0
                while out['logprobs']['text_offset'][i] < len(inp):
                    i += 1
                loss = -sum(out['logprobs']["token_logprobs"][i:])
                losses.append(loss / (len(out['logprobs']['text_offset']) - i))
        return losses, cache


    def do_predict(self, losses, metadata):
        predictions = []
        for dp in metadata:
            curr_label_losses = [losses[index] for index in dp["indices"]]
            prediction_idx = sorted(enumerate(curr_label_losses), key=lambda x: x[1])[0][0]
            prediction = dp["options"][prediction_idx]
            predictions.append(prediction.strip())
        return predictions


    def gpt3(self, prompt, max_len=0, temp=0, num_log_probs=0, echo=True, n=None):
        # call GPT-3 API until result is provided and then return it
        response = None
        received = False
        while not received:
            try:
                response = openai.Completion.create(engine=self.model_name,
                                                    prompt=prompt,
                                                    max_tokens=max_len,
                                                    temperature=temp,
                                                    logprobs=num_log_probs,
                                                    echo=echo,
                                                    stop='\n',
                                                    n=n)
                received = True
            except:
                error = sys.exc_info()[0]
                if error == openai.error.InvalidRequestError:
                    # something is wrong: e.g. prompt too long
                    print(f"InvalidRequestError\nPrompt passed in:\n\n{prompt}\n\n")
                    assert False
                print("API error:", error)
                time.sleep(1)
        return response


================================================
FILE: templates.py
================================================
import string

TEMPLATES = {
    "financial_phrasebank": {
        "direct" : ("{}", "The sentiment is: {}"),
        "channel": ("{}", "The sentiment is: {}")
    },
    "poem_sentiment": {
        "direct" : ("{}", "The sentiment is: {}"),
        "channel": ("{}", "The sentiment is: {}")
    },
    "glue-mrpc": {
        "direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"),
        "channel": ("The question is: {} True or False?\n{}", "The answer is: {}")
    },
    "glue-rte": {
        "direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"),
        "channel": ("The question is: {} True or False?\n{}", "The answer is: {}")
    },
    "sick": {
        "direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"),
        "channel": ("The question is: {} True or False?\n{}", "The answer is: {}")
    },
    "tweet_eval-hate": {
        "direct" : ("Tweet: {}", "Sentiment: {}"),
        "channel": ("Tweet: {}", "Sentiment: {}"),
    },
    "openbookqa": {
        "direct" : ("The question is: {}", "The answer is: {}"),
        "channel": ("The question is: {}", "The answer is: {}")
    },
    "ai2_arc": {
        "direct" : ("The question is: {}", "The answer is: {}"),
        "channel": ("The question is: {}", "The answer is: {}")
    },
    "codah": {
        "direct" : ("The question is: {}", "The answer is: {}"),
        "channel": ("The question is: {}", "The answer is: {}")
    },
    "commonsense_qa": {
        "direct" : ("The question is: {}", "The answer is: {}"),
        "channel": ("The question is: {}", "The answer is: {}")
    }
}

def apply_template(dp, dataset, method):
    if dataset.startswith("superglue-copa"):
        if method == "direct":
            if dp["input"].startswith("Cause: "):
                dp["input"] = dp["input"][7:-1] + " so"
                dp["output"] = dp["output"][8].lower() + dp["output"][9:]
                for i, options in enumerate(dp["options"]):
                    dp["options"][i] = dp["options"][i][8].lower() + dp["options"][i][9:]
            elif dp["input"].startswith("Effect: "):
                dp["input"] = dp["input"][8:-1] + " because"
                dp["output"] = dp["output"][7].lower() + dp["output"][8:]
                for i, options in enumerate(dp["options"]):
                    dp["options"][i] = dp["options"][i][7].lower() + dp["options"][i][8:]
            else:
                raise NotImplementedError()
        elif method == "channel":
            if dp["output"].startswith("Cause: "):
                dp["output"] = dp["output"][7:-1] + " so"
                dp["input"] = dp["input"][8].lower() + dp["input"][9:]
                for i, options in enumerate(dp["options"]):
                    dp["options"][i] = dp["options"][i][7:-1] + " so"
            elif dp["output"].startswith("Effect: "):
                dp["output"] = dp["output"][8:-1] + " because"
                dp["input"] = dp["input"][7].lower() + dp["input"][8:]
                for i, options in enumerate(dp["options"]):
                    dp["options"][i] =  dp["options"][i][8:-1] + " because"
        else:
            raise NotImplementedError(o)
    elif dataset.startswith("glue") or dataset.startswith("sick"):
        def map_option(option):
            if option in ["equivalent", "entailment"]:
                return "True"
            if option in ["not_equivalent", "not_entailment", "contradiction"]:
                return "False"
            if option in ["neutral"]:
                return "Not sure"
            raise NotImplementedError(option)
        dp["input"] = dp["input"].replace("sentence 1: ", "").replace("sentence 2: ", "")
        splits = dp["input"].split(" [SEP] ")
        if method=="channel":
            splits = [splits[1], splits[0]]
        splits = [split if split[-1] in string.punctuation else split+"." for split in splits]
        dp["input"] = TEMPLATES[dataset][method][0].format(splits[0], splits[1])
        dp["output"] = TEMPLATES[dataset][method][1].format(map_option(dp["output"]))
        for i, options in enumerate(dp["options"]):
            dp["options"][i] =TEMPLATES[dataset][method][1].format(map_option(dp["options"][i]))
    else:
        def map_option(option):
            if dataset=="tweet_eval-hate":
                return {"hate": "against", "non-hate": "favor"}[option]
            return option
        dp["input"] = TEMPLATES[dataset][method][0].format(dp["input"])
        dp["output"] = TEMPLATES[dataset][method][1].format(map_option(dp["output"]))
        for i, options in enumerate(dp["options"]):
            dp["options"][i] =TEMPLATES[dataset][method][1].format(map_option(dp["options"][i]))


================================================
FILE: test_gpt3.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import os
import argparse
import pickle as pkl
import random
import torch
import math
import json
import string
import logging
import numpy as np

from tqdm import tqdm
from collections import Counter, defaultdict

from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
from transformers import GPT2Tokenizer, AutoTokenizer

from metaicl.data import MetaICLData
from metaicl.model import MetaICLModel

from utils.data import load_data

from gpt3 import GPT3Model

def main(logger, args):
    assert (args.dataset is not None and args.task is None) or (args.dataset is None and args.task is not None)

    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    add_newlines = True

    ### checkpoint ...
    if not args.do_zeroshot:
        if args.checkpoint is not None:
            checkpoint = args.checkpoint
            assert args.global_step is None
        else:
            assert args.global_step is not None
            checkpoint = os.path.join(args.out_dir, "model-{}.pt".format(args.global_step))
        assert os.path.exists(checkpoint)
    else:
        add_newlines = False
        checkpoint = None	
        	
    metaicl_model = GPT3Model(args.gpt3, args.api, logger)

    if not os.path.exists(args.out_dir):
        os.makedirs(args.out_dir)

    # setup hyperparams for data

    max_length_per_example = 256
    max_length = 256
    if args.use_demonstrations:
        orig_max_length = max_length
        if args.do_zeroshot:
            max_length = min(max_length * args.k, 1024)
        else:
            max_length = min(max_length * args.k, 1024)

    logger.info("batch_size=%d\tmax_length=%d\tmax_length_per_example=%d" % (
        args.test_batch_size, max_length, max_length_per_example))

    metaicl_data = MetaICLData(logger, tokenizer, args.method,args.use_demonstrations, args.k,
                               max_length, max_length_per_example)

    results = []
    errors = []
    seeds = args.seed.split(",")
    config_split = "unseen_domain_test" if args.unseen_domain_only else "test"

    for seed in seeds:

        ### data ...
        train_data = load_data(args.task, "train", args.k, seed=seed, config_split=config_split,
                               datasets=None if args.dataset is None else args.dataset.split(","))
        dev_data = load_data(args.task, args.split, args.k, seed=seed, config_split=config_split,
                             datasets=None if args.dataset is None else args.dataset.split(","), is_null=args.is_null)

        train_counter = Counter()
        dev_counter = Counter()
        for dp in train_data:
            train_counter[dp["task"]] += 1
        for dp in dev_data:
            dev_counter[dp["task"]] += 1
        for k, v in train_counter.items():
            logger.info("[Train] %s\t%d" % (k, v))
        for k, v in dev_counter.items():
            logger.info("[Dev] %s\t%d" % (k, v))

        logger.info("%s on %s (%d train, %d dev)" % (args.method, args.task, len(train_counter), len(dev_counter)))

        for test_task in dev_counter:
            curr_dev_data = [dp for dp in dev_data if dp["task"]==test_task]
            curr_train_data = [dp for dp in train_data if dp["task"]==test_task]
            assert len(curr_dev_data)>0
            assert not args.use_demonstrations or len(curr_train_data)==args.k, \
                    (args.use_demonstrations, len(curr_train_data), args.k)

            config_file = "config/tasks/{}.json".format(test_task)
            assert os.path.exists(config_file), config_file
            with open(config_file, "r") as f:
                config = json.load(f)
            is_classification = config["task_type"]=="classification"
            if is_classification:
                options = curr_dev_data[0]["options"]
                assert np.all([d["options"]==options for d in curr_dev_data+curr_train_data])

            result = run(logger, test_task, metaicl_data, metaicl_model,
                         curr_train_data, curr_dev_data, seed, checkpoint, is_classification, add_newlines)

            if result is None:
                errors.append("%s/%s" % (test_task, seed))
            else:
                results.append(result)

    if args.is_null:
        return

    logger.info("Macro-F1 of %s over %d target tasks: %.1f" % (args.task, len(results) // len(seeds), 100*np.mean(results)))

    if len(errors)>0:
        logger.info("You had errors with datasets:", ",".join(errors))
        logger.info("Please see the error messages")


def run(logger, task, metaicl_data, metaicl_model, train_data, dev_data, seed,
        checkpoint, is_classification, add_newlines):

    if args.do_zeroshot:
        split_name = args.split
        if args.is_null:
            split_name += "-null"
        cache_path = os.path.join(args.out_dir,
                                  "{}-{}-{}{}{}{}.pkl".format(
                                      task,
                                      split_name,
                                      metaicl_data.method,
                                      "-k={}".format(args.k) if args.use_demonstrations else "",
                                      "-s={}".format(seed) if args.use_demonstrations else "",
                                      "" if add_newlines else "-no-newlines"))
        gpt3_cache_path = os.path.join(args.out_dir,
                                  "{}-{}-{}{}{}{}.json".format(
                                      task,
                                      split_name,
                                      metaicl_data.method,
                                      "-k={}".format(args.k) if args.use_demonstrations else "",
                                      "-s={}".format(seed) if args.use_demonstrations else "",
                                      "" if add_newlines else "-no-newlines"))
    else:
        assert add_newlines
        cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}.pkl".format(
                        task,
                        args.split,
                        metaicl_data.method,
                        "-k={}".format(args.k) if args.use_demonstrations else "",
                        "-s={}".format(seed) if args.use_demonstrations else ""
                      ))
        gp3_cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}.json".format(
                        task,
                        args.split,
                        metaicl_data.method,
                        "-k={}".format(args.k) if args.use_demonstrations else "",
                        "-s={}".format(seed) if args.use_demonstrations else ""
                      ))

    metaicl_data.tensorize(train_data, dev_data, add_newlines=add_newlines)
    gpt3_dataloader, gpt3_metadata = metaicl_model.prepare_data(train_data if args.use_demonstrations else [],
                                dev_data, args.method, batch_size=args.test_batch_size)
    # metaicl_data.print_tensorized_example()
    logger.info(cache_path)

    if os.path.exists(cache_path):
        with open(cache_path, "rb") as f:
            losses = pkl.load(f)
    else:
        losses, gpt3cache = metaicl_model.do_inference(gpt3_dataloader)	
        with open(cache_path, "wb") as f:
            pkl.dump(losses, f)
        with open(gpt3_cache_path, "w") as f:	
            json.dump(gpt3cache, f)

    if args.is_null:
        return None

    if args.use_calibration:
        assert args.do_zeroshot
        bias_path = cache_path.replace("/"+task+"-"+args.split, "/"+task+"-"+args.split+"-null")
        assert os.path.exists(bias_path), bias_path
        with open(bias_path, "rb") as f:
            bias_losses = pkl.load(f)

        losses = np.array(losses)
        bias_losses = np.array(bias_losses)
        assert losses.shape == bias_losses.shape
        losses -= bias_losses

    predictions = metaicl_model.do_predict(losses=losses, metadata=gpt3_metadata)
    groundtruths = [dp["output"] for dp in dev_data]
    perf = metaicl_data.evaluate(predictions, groundtruths, is_classification)
    logger.info("Accuracy=%s" % perf)

    prediction_path = cache_path.replace(".pkl", ".txt")
    if args.use_calibration:
        prediction_path = prediction_path.replace(".txt", "-calibrated.txt")

    with open(prediction_path, "w") as f:
        for prediction in predictions:
            f.write(prediction)
            f.write("\n")

    return perf

if __name__=='__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument("--do_zeroshot", default=False, action="store_true")
    parser.add_argument("--use_demonstrations", default=False, action="store_true")
    parser.add_argument("--use_calibration", default=False, action="store_true")
    parser.add_argument("--unseen_domain_only", default=False, action="store_true")

    parser.add_argument("--log_file", default=None, type=str)

    parser.add_argument("--task", type=str, default=None)
    parser.add_argument("--dataset", type=str, default=None)
    parser.add_argument("--k", type=int, default=16)
    parser.add_argument("--seed", type=str, default="100")

    parser.add_argument("--test_batch_size", type=int, default=64)
    parser.add_argument("--global_step", type=str, default=None)
    parser.add_argument("--checkpoint", type=str, default=None)

    parser.add_argument("--out_dir", type=str, required=True)

    parser.add_argument("--split", type=str, default="test")
    parser.add_argument("--is_null", default=False, action="store_true")
    parser.add_argument("--method", type=str, default="direct", choices=["direct", "channel"])
    parser.add_argument("--gpt3", type=str, default="davinci", choices=["ada", "babbage", "curie", "davinci"])
    parser.add_argument("--api", type=str, required=True)

    args = parser.parse_args()

    handlers = [logging.StreamHandler()]
    if args.log_file is not None:
        handlers.append(logging.FileHandler(args.log_file))
    logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
                        datefmt='%m/%d/%Y %H:%M:%S',
                        level=logging.INFO,
                        handlers=handlers)
    logger = logging.getLogger(__name__)
    logger.info(args)

    main(logger, args)