Repository: Alrope123/rethinking-demonstrations Branch: main Commit: d8267faf528b Files: 5 Total size: 43.2 KB Directory structure: gitextract_6owm_f3k/ ├── README.md ├── create_data.py ├── gpt3.py ├── templates.py └── test_gpt3.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? This includes an original implementation of "[Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?][paper]" by [Sewon Min][sewon], [Xinxi Lyu][xinxi], [Ari Holtzman][ari], [Mikel Artetxe][mikel], [Mike Lewis][mike], [Hannaneh Hajishirzi][hanna], and [Luke Zettlemoyer][luke]. This code provides: - Codes for creating the variants of the demonstrations used in the experiments. - Commands to run the models and get numbers reported in the paper, based on the [MetaICL][metaicl] codebase. Please leave issues for any questions about the paper or the code. If you find our code or paper useful, please cite the paper: ``` @inproceedings{ min2022rethinking, title={ Rethinking the Role of Demonstrations: What makes In-context Learning Work? }, author={ Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke }, booktitle={ EMNLP }, year={ 2022 } } ``` ### Announcements * 07/25/2022: The code also supports running GPT-3 now. * 02/25/2022: The code supports running GPT-2, MetaICL and GPT-J for now. Please contact authors for running other models. ## Content 1. [Preparation](#preparation) 2. [Reproducing Main Experiments](#reproducing-main-experiments) (Section 4.1 of the paper) * [No Demonstrations](#no-demonstrations) * [Demonstrations with gold labels](#demonstrations-with-gold-labels) * [Demonstrations with random labels](#demonstrations-with-random-labels) 3. [Reproducing Ablations](#reproducing-ablations) (Section 4.2 of the paper) * [Number of correct labels](#number-of-correct-labels) * [Number of input-label pairs in the demonstrations](#number-of-input-label-pairs-in-the-demonstrations) * [Using manual templates](#using-manual-templates) 4. [Reproducing Analysis](#reproducing-analysis) (Section 5 of the paper) * [Demonstrations with OOD input text](#demonstrations-with-ood-input-text) * [Demonstrations with random english words](#demonstrations-with-random-english-words) * [Demonstrations with random labels only (no inputs)](#demonstrations-with-random-labels-only-no-inputs) * [Demonstrations with no labels (inputs only)](#demonstrations-with-no-labels-inputs-only) ## Preparation The code is tested with python 3.8. The data and the code are based on the MetaICL codebase. ```bash git remote add metaicl https://github.com/facebookresearch/MetaICL.git git pull metaicl main --allow-unrelated-histories -X ours ``` Install the data dependencies and download the data. ```bash conda conda create --name metaicl-data python=3.8 conda activate metaicl-data pip install datasets==1.4.0 wget cd preprocess python _build_gym.py --build --n_proc=40 --do_test ``` This uses `k=16` by default. If you want to run ablations with varying `k`, please also run the following. ```bash python _build_gym.py --build --n_proc=40 --do_test --test_k {4|8|32} ``` After preprocesisng is done, come back to the main directory. ```bash cd ../ conda deactivate ``` Now, install the model dependencies to run the model. Please note that the Transformer version is not compatible to the datasets library used to download the data, so make sure to use a different environment. ``` conda conda create --name metaicl python=3.8 conda activate metaicl pip install torch==1.9.0 pip install git+https://github.com/huggingface/transformers.git@c37573806ab3526dd805c49cbe2489ad4d68a9d7 ``` (Optional) Install OpenAI Python Library for running GPT-3 ``` pip install openai ``` ## Reproducing Main Experiments This is for reproducing experiments in Section 4.1 of the paper. Evaluation datasets are: * Classification (16 datasets): `financial_phrasebank`,`poem_sentiment`,`glue-wnli`,`climate_fever`,`glue-rte`,`superglue-cb`,`sick`,`medical_questions_pairs`,`glue-mrpc`,`hate_speech18`,`ethos-national_origin`,`ethos-race`,`ethos-religion`,`tweet_eval-hate`,`tweet_eval-stance_atheism`,`tweet_eval-stance_feminist` * Multi-choice (10 datasets): `quarel`,`openbookqa`,`qasc`,`commonsense_qa`,`ai2_arc`,`codah`,`superglue-copa`,`dream`,`quartz-with_knowledge`,`quartz-no_knowledge` #### No Demonstrations To run the evaluation of No-Demonstrations: ```bash # Direct GPT-2 Large python test.py --dataset {dataset} --gpt2 gpt2-large --method direct --out_dir out/gpt2-large --do_zeroshot # Channel GPT-2 Large python test.py --dataset {dataset} --gpt2 gpt2-large --method channel --out_dir out/gpt2-large --do_zeroshot # Direct MetaICL python test.py --dataset {dataset} --gpt2 metaicl --method direct --out_dir out/metaicl --do_zeroshot # Channel MetaICL python test.py --dataset {dataset} --gpt2 channel-metaicl --method channel --out_dir out/channel-metaicl --do_zeroshot # Direct GPT-J python test.py --dataset {dataset} --gpt2 gpt-j-6B --method direct --out_dir out/gpt-j --do_zeroshot # Channel GPT-J python test.py --dataset {dataset} --gpt2 gpt-j-6B --method channel --out_dir out/gpt-j --do_zeroshot # GPT-3 python test_gpt3.py --dataset {dataset} --gpt3 {ada|babbage|curie|davinci} --method {direct|channel} --out_dir out/gpt3 --do_zeroshot --api {API key} ``` Note that `test.py` and `test_gpt3.py` does not support multi-gpu for inference. Other useful flags: * `--test_batch_size`: can be adjusted based on your GPU memory availability. With a 32GB GPU, you can use 64 for GPT-2 Large & MetaICL, and 16 for GPT-J **with no demonstrations**. Later, when you run the code **with demonstrations**, decreasing the batch size by 4 times typically works, e.g., 16 (GPT-2 Large & MetaICL) and 4 (GPT-J) with a 32GB GPU. * `--log_file`: if you want to save logs in a file, you can specify the path to the log file. Notes for running GPT-3: * You can create/check your OpenAI API keys by visiting [this link](https://beta.openai.com/account/api-keys). * Running with GPT-3 can be expensive, and different models of GPT-3 comes with different costs. Please check [this link](https://openai.com/api/pricing/) to evaluate the cost before running each experiment. * The responses from the GPT-3 API are cached in the `out_dir`. From now on, we will use the above commands as a default and tell you which flags you need to add. #### Demonstrations with gold labels Run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87`. #### Demonstrations with random labels Create the demonstrations with random labels via: ```bash python create_data.py --variant random --dataset {dataset} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_random`. ## Reproducing Ablations This is for reproducing experiments in Section 4.2 of the paper. Evaluation datasets are: * Classification (5 datasets): `poem_sentiment`,`glue-rte`,`sick`,`glue-mrpc`,`tweet_eval-hate` * Multi-choice (4 datasets): `openbookqa`,`commonsense_qa`,`ai2_arc`,`superglue-copa` #### Number of correct labels Create the demonstrations with varying number of correct labels via: ```bash python create_data.py --variant {75|50|25|0}_correct --dataset {dataset} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_{75|50|25|0}_correct`. #### Number of input-label pairs in the demonstrations (Note that you should have run preprocessing with varying `k` to run this ablation. If you have not done this, please re-visit the [Preparation](#preparation) section.) Create the demonstrations with varying `k` via: ```bash python create_data.py --variant random --dataset {dataset} --k {4|8|16|32} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k {4|8|16|32} --seed 100,13,21,42,87 --dataset {dataset}_random`. #### Using manual templates Create the demonstrations with varying type of labels and inference method via: ```bash python create_data.py --variant {gold|random}_w_template --dataset {dataset} --method {direct|channel} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_{gold|random}_w_template_{direct|channel}`. ## Reproducing Analysis This is for reproducing experiments in Section 5 of the paper. Evaluation datasets are: * Classification (5 datasets): `poem_sentiment`,`glue-rte`,`sick`,`glue-mrpc`,`tweet_eval-hate` * Multi-choice (4 datasets): `openbookqa`,`commonsense_qa`,`ai2_arc`,`superglue-copa` #### Demonstrations with OOD input text First, you need a corpus file in a .txt format, where each line is a sentence (in the plain text). In the paper, we used samples from the English portion of CC News, which we are unable to release here. Please visit [this link](https://commoncrawl.org/2016/10/news-dataset-available/) to learn more about how to download the CC News corpus. Create the demonstrations with OOD input text via: ```bash python create_data.py --variant ood_inputs --dataset {dataset} --corpus_path {corpus_path} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_ood_inputs`. #### Demonstrations with random english words Create the demonstrations with random English words as labels via: ```bash python create_data.py --variant random_english_words --dataset {dataset} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed {seed} --dataset {dataset}_random_english_words_seed={seed}`, where `seed` can be one of 100, 13, 21, 42, and 87. #### Demonstrations with random labels only (no inputs) Create the demonstrations with random labels only via: ```bash python create_data.py --variant random_labels_only --dataset {dataset} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_random_labels_only`. #### Demonstrations with no labels (inputs only) Create the demonstrations with no labels via: ```bash python create_data.py --variant no_labels --dataset {dataset} ``` Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_no_labels`. [paper]: https://arxiv.org/abs/2202.12837 [sewon]: http://shmsw25.github.io/ [xinxi]: https://alrope123.github.io/ [ari]: https://ari-holtzman.github.io/ [mikel]: https://scholar.google.com/citations?user=N5InzP8AAAAJ&hl=en [mike]: https://ai.facebook.com/people/mike-lewis/ [hanna]: https://homes.cs.washington.edu/~hannaneh/index.html [luke]: https://www.cs.washington.edu/people/faculty/lsz [metaicl]: https://github.com/facebookresearch/MetaICL ================================================ FILE: create_data.py ================================================ import os import argparse import random import json import numpy as np from collections import defaultdict, Counter from templates import apply_template def main(args): assert args.variant in [ "gold", "random", # main experiments in Section 4 "75_correct", "50_correct", "25_correct", "0_correct", # ablations in Section 4 "gold_w_template", "random_w_template", # ablations in Section 4 "ood_inputs", "random_english_words", "random_labels_only", "no_labels", # Section 5 "random_english_words_gold_labels", "permutated_labels", "random_true_distribution" ] if args.variant in ["gold_w_template", "random_w_template"]: assert args.method is not None, "Please specify `--method` with the inference method (`direct` or `channel`) for using the template." assert args.method in ["direct", "channel"], "Please make sure to use either `direct` or `channel`." if args.variant=="gold": print ("No need to run `create_data.py` --- you can use the original data as it is.") return if args.variant=="ood_inputs": # load sources for OOD inputs assert args.corpus_path is not None, \ """ Please note that you need to specify the path to the corpus from which the OOD inputs will be sampled. It should be a .txt file where each line contains a sentence (plain text). """ grouped_samples = defaultdict(list) with open(args.corpus_path, "r") as f: random_texts = [] random_text_lens = [] for line in f: line = line.strip() random_texts.append(line) random_text_lens.append(len(line.split())) random_text_lens = np.array(random_text_lens) elif args.variant in ["random_english_words", "random_english_words_gold_labels"]: from english_words import english_words_set english_words_set = sorted(english_words_set) datasets = args.dataset.split(',') new_datasets = [dataset + "_" + args.variant + (("_" + args.method) if args.method is not None else "") for dataset in datasets] assert len(datasets) == len(new_datasets) ################################################################################################################ seeds = args.seed.split(',') perfs = [] for dataset_idx, (dataset, new_dataset) in enumerate(zip(datasets, new_datasets)): # contruct and save a new config file and data directory config_file = os.path.join(args.config_dir, "tasks") assert os.path.exists(config_file), config_file with open(os.path.join(config_file, "{}.json".format(dataset)), "r") as f: config = json.load(f) # in case of random English words, we will create a config file and data directory # for each random seed later on (since the data is different across seeds) if args.variant not in ["random_english_words", "random_english_words_gold_labels"]: with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f: json.dump(config, f) new_dataset_dir = os.path.join(args.data_dir, new_dataset) if not os.path.exists(new_dataset_dir): os.mkdir(new_dataset_dir) # load full training data to get the distribution of the labels if args.variant=="random_true_distribution": full_train_data_path = os.path.join(args.data_dir, dataset, "{}_16384_100_train.jsonl".format(dataset)) assert os.path.exists(full_train_data_path), "Please generate full training data first by running _build_gym.py with k=16384." full_train_data_labels = [] with open(full_train_data_path, "r") as f: for line in f: dp = json.loads(line) assert dp["task"]==dataset full_train_data_labels.append(dp["output"]) train_label_counter = Counter(full_train_data_labels) train_label_distribution = {label : train_label_counter[label] / len(full_train_data_labels) for label in train_label_counter} for seed in seeds: # random seed np.random.seed(int(seed)) if args.variant in ["random_english_words", "random_english_words_gold_labels"]: new_dataset = new_datasets[dataset_idx] + "_seed={}".format(seed) # read the original training and test data # note that we are modifying the training data only, # and the test data will always be the same # (we are creating duplicates only for convenience) train_data = [] train_data_path = os.path.join(args.data_dir, dataset, "{}_{}_{}_{}.jsonl".format(dataset, args.k, seed, "train")) with open(train_data_path, "r") as f: for line in f: dp = json.loads(line) assert dp["task"]==dataset dp["task"] = new_dataset train_data.append(dp) test_data = [] test_data_path = os.path.join(args.data_dir, dataset, "{}_{}_{}_{}.jsonl".format(dataset, args.k, seed, "test")) with open(test_data_path, "r") as f: for line in f: dp = json.loads(line) assert dp["task"]==dataset dp["task"] = new_dataset test_data.append(dp) # apply templates to inputs and labels if args.variant in ["gold_w_template", "random_w_template"]: for dp in train_data: apply_template(dp, dataset, args.method) for dp in test_data: apply_template(dp, dataset, args.method) # now, for random_english_words, create a config file and data directory if args.variant in ["random_english_words", "random_english_words_gold_labels"]: new_dataset_dir = os.path.join(args.data_dir, new_dataset) if not os.path.exists(new_dataset_dir): os.mkdir(new_dataset_dir) if config["task_type"]=="classification": new_options = list(np.random.choice(english_words_set, size=len(config["options"]), replace=False)) new_mapping = {option: new_option for option, new_option in zip(config["options"], new_options)} new_config = config.copy() new_config["options"] = new_options with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f: json.dump(new_config, f) for i, dp in enumerate(train_data): train_data[i]["output"] = new_mapping[dp["output"]] train_data[i]["options"] = [new_mapping[option] for option in dp["options"]] if args.variant == "random_english_words_gold_labels": # also modify the test data for classification tasks for i, dp in enumerate(test_data): test_data[i]["output"] = new_mapping[dp["output"]] test_data[i]["options"] = [new_mapping[option] for option in dp["options"]] elif config["task_type"]=="multi-choice": with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f: json.dump(config, f) shuffled_indices = np.random.permutation(range(len(english_words_set))) shuffled_options = [english_words_set[i] for i in shuffled_indices] offset = 0 for i, dp in enumerate(train_data): new_options = shuffled_options[offset:offset+len(dp["options"])] offset += len(dp["options"]) train_data[i]["output"] = new_options[dp["options"].index(dp["output"])] train_data[i]["options"] = new_options else: raise NotImplementedError() # modify both train input and test input for permutated_labels with classification tasks if args.variant == "permutated_labels" and config["task_type"]=="classification": old_options = config["options"] new_options = [old_options[(i+1)%len(old_options)] for i in range(len(old_options))] new_mapping = {old_option: new_option for old_option, new_option in zip(old_options, new_options)} for i, dp in enumerate(train_data): train_data[i]["output"] = new_mapping[dp["output"]] for i, dp in enumerate(test_data): test_data[i]["output"] = new_mapping[dp["output"]] ## modify labels in the training data if args.variant in ["75_correct", "50_correct", "25_correct"]: num_correct = args.k * int(args.variant.split("_")[0]) // 100 indices_correct = np.random.permutation(range(args.k))[:num_correct] for dp_idx, dp in enumerate(train_data): if args.variant in ["gold", "gold_w_template", "permutated_labels", "random_english_words_gold_labels"] or \ (args.variant in ["75_correct", "50_correct", "25_correct"] and dp_idx in indices_correct): # assign correct label pass elif args.variant.endswith("_correct"): # assign incorrect label dp["output"] = dp["options"][np.random.choice([i for i in range(len(dp["options"])) if dp["options"][i] != dp["output"]])] elif args.variant=="no_labels": # assign empty label dp["output"] = "" dp["options"] = [""] elif args.variant=="random_true_distribution": # assign random labels according to the distribution in the training data dp["output"] = np.random.choice(list(train_label_distribution.keys()), p=list(train_label_distribution.values())) else: # assign random label dp["output"] = np.random.choice(dp["options"]) ## modify inputs in the training data if args.variant=="random_labels_only": for dp in train_data: dp["input"] = "" elif args.variant=="ood_inputs": new_train_data = [] for dp in test_data: l = len(dp["input"].split()) prob = np.exp(-np.power(random_text_lens-l, 2)/50) prob /= np.sum(prob) samples = np.random.choice(random_texts, size=args.k, replace=False, p=prob) assert len(samples)==len(train_data) new_train_data.append([]) for train_dp, sample in zip(train_data, samples): new_train_data[-1].append({"task": train_dp["task"], "input": sample, "output": train_dp["output"], "options": train_dp["options"]}) train_data = new_train_data # write the modified data with open(os.path.join(new_dataset_dir, "{}_{}_{}_{}.jsonl".format(new_dataset, args.k, seed, "train")), "w") as f: for dp in train_data: f.write(json.dumps(dp)) f.write("\n") with open(os.path.join(new_dataset_dir, "{}_{}_{}_{}.jsonl".format(new_dataset, args.k, seed, "test")), "w") as f: for dp in test_data: f.write(json.dumps(dp)) f.write("\n") print ("Done for %s seed=%s" % (new_dataset, seed)) if __name__=='__main__': parser = argparse.ArgumentParser() parser.add_argument("--dataset", type=str, default=None) parser.add_argument("--k", type=int, default=16) parser.add_argument("--seed", type=str, default="100,13,21,42,87") parser.add_argument("--variant", type=str, default="random", required=True) parser.add_argument("--method", type=str, default=None) parser.add_argument("--data_dir", type=str, default="data") parser.add_argument("--config_dir", type=str, default="config") parser.add_argument("--corpus_path", type=str, default=None) args = parser.parse_args() main(args) ================================================ FILE: gpt3.py ================================================ import time import sys import numpy as np import torch import json import openai from torch.utils.data import TensorDataset, DataLoader, SequentialSampler from transformers import GPT2Tokenizer class GPT3Model(object): def __init__(self, model_name, api_key, logger=None): self.model_name = model_name try: openai.api_key = api_key except Exception: pass self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl") self.logger=logger def prepare_data(self, train_data, test_data, method, batch_size=10, dp_sep="\n", max_length=1024): # format demonstrations demonstrations = "" for dp in train_data: if method=="direct": demonstrations += "{}{}{}\n\n\n".format(dp["input"], dp_sep, dp["output"]) elif method=="channel": demonstrations += "{}{}{}\n\n\n".format(dp["output"], dp_sep, dp["input"]) else: raise NotImplementedError() # append demonstrations and separate options inputs = [] outputs = [] metadata = [] for dp in test_data: prompt = dp["input"] options = dp["options"] indices = [i for i in range(len(inputs), len(inputs) + len(options))] metadata.append({"indices": indices, "options": options}) if method=="direct": inputs += [demonstrations + prompt + dp_sep for option in options] outputs += [option for option in options] elif method=="channel": inputs += [demonstrations + option + dp_sep for option in options] outputs += [prompt for option in options] else: raise NotImplementedError() # truncate inputs for i, (inp, out) in enumerate(zip(inputs, outputs)): input_ids = self.tokenizer.encode(inp) output_ids = self.tokenizer.encode(out) if (len(input_ids) + len(output_ids) > max_length): input_ids = input_ids[len(input_ids)+len(output_ids) - max_length:] assert len(input_ids)+len(output_ids) == max_length inputs[i] = self.tokenizer.decode(input_ids) if self.logger is not None: self.logger.info("Checking the first example...") self.logger.info(inputs[0] + "" + outputs[0]) # construct a dataloader dataset = zip(inputs, outputs) input_chunks = [inputs[i : i + batch_size] for i in range(0, len(inputs), batch_size)] output_chunks = [outputs[i : i + batch_size] for i in range(0, len(outputs), batch_size)] dataloader = [(input_chunks[i], output_chunks[i]) for i in range(0, len(input_chunks))] return dataloader, metadata def do_inference(self, dataloader): losses = [] cache = [] cost = 0 for inputs, outputs in dataloader: data = [inp + out for inp, out in zip(inputs, outputs)] response = self.gpt3(data) for choice in response["choices"]: cost += len(choice["logprobs"]["tokens"]) * 0.00006 # print("current cost = " + str(cost)) cache.append((data, response)) # get the beginning of the target from the response (based on tokenization) for inp, outp, out in zip(inputs, outputs, response["choices"]): assert inp+outp==out["text"] i = 0 while out['logprobs']['text_offset'][i] < len(inp): i += 1 loss = -sum(out['logprobs']["token_logprobs"][i:]) losses.append(loss / (len(out['logprobs']['text_offset']) - i)) return losses, cache def do_predict(self, losses, metadata): predictions = [] for dp in metadata: curr_label_losses = [losses[index] for index in dp["indices"]] prediction_idx = sorted(enumerate(curr_label_losses), key=lambda x: x[1])[0][0] prediction = dp["options"][prediction_idx] predictions.append(prediction.strip()) return predictions def gpt3(self, prompt, max_len=0, temp=0, num_log_probs=0, echo=True, n=None): # call GPT-3 API until result is provided and then return it response = None received = False while not received: try: response = openai.Completion.create(engine=self.model_name, prompt=prompt, max_tokens=max_len, temperature=temp, logprobs=num_log_probs, echo=echo, stop='\n', n=n) received = True except: error = sys.exc_info()[0] if error == openai.error.InvalidRequestError: # something is wrong: e.g. prompt too long print(f"InvalidRequestError\nPrompt passed in:\n\n{prompt}\n\n") assert False print("API error:", error) time.sleep(1) return response ================================================ FILE: templates.py ================================================ import string TEMPLATES = { "financial_phrasebank": { "direct" : ("{}", "The sentiment is: {}"), "channel": ("{}", "The sentiment is: {}") }, "poem_sentiment": { "direct" : ("{}", "The sentiment is: {}"), "channel": ("{}", "The sentiment is: {}") }, "glue-mrpc": { "direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"), "channel": ("The question is: {} True or False?\n{}", "The answer is: {}") }, "glue-rte": { "direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"), "channel": ("The question is: {} True or False?\n{}", "The answer is: {}") }, "sick": { "direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"), "channel": ("The question is: {} True or False?\n{}", "The answer is: {}") }, "tweet_eval-hate": { "direct" : ("Tweet: {}", "Sentiment: {}"), "channel": ("Tweet: {}", "Sentiment: {}"), }, "openbookqa": { "direct" : ("The question is: {}", "The answer is: {}"), "channel": ("The question is: {}", "The answer is: {}") }, "ai2_arc": { "direct" : ("The question is: {}", "The answer is: {}"), "channel": ("The question is: {}", "The answer is: {}") }, "codah": { "direct" : ("The question is: {}", "The answer is: {}"), "channel": ("The question is: {}", "The answer is: {}") }, "commonsense_qa": { "direct" : ("The question is: {}", "The answer is: {}"), "channel": ("The question is: {}", "The answer is: {}") } } def apply_template(dp, dataset, method): if dataset.startswith("superglue-copa"): if method == "direct": if dp["input"].startswith("Cause: "): dp["input"] = dp["input"][7:-1] + " so" dp["output"] = dp["output"][8].lower() + dp["output"][9:] for i, options in enumerate(dp["options"]): dp["options"][i] = dp["options"][i][8].lower() + dp["options"][i][9:] elif dp["input"].startswith("Effect: "): dp["input"] = dp["input"][8:-1] + " because" dp["output"] = dp["output"][7].lower() + dp["output"][8:] for i, options in enumerate(dp["options"]): dp["options"][i] = dp["options"][i][7].lower() + dp["options"][i][8:] else: raise NotImplementedError() elif method == "channel": if dp["output"].startswith("Cause: "): dp["output"] = dp["output"][7:-1] + " so" dp["input"] = dp["input"][8].lower() + dp["input"][9:] for i, options in enumerate(dp["options"]): dp["options"][i] = dp["options"][i][7:-1] + " so" elif dp["output"].startswith("Effect: "): dp["output"] = dp["output"][8:-1] + " because" dp["input"] = dp["input"][7].lower() + dp["input"][8:] for i, options in enumerate(dp["options"]): dp["options"][i] = dp["options"][i][8:-1] + " because" else: raise NotImplementedError(o) elif dataset.startswith("glue") or dataset.startswith("sick"): def map_option(option): if option in ["equivalent", "entailment"]: return "True" if option in ["not_equivalent", "not_entailment", "contradiction"]: return "False" if option in ["neutral"]: return "Not sure" raise NotImplementedError(option) dp["input"] = dp["input"].replace("sentence 1: ", "").replace("sentence 2: ", "") splits = dp["input"].split(" [SEP] ") if method=="channel": splits = [splits[1], splits[0]] splits = [split if split[-1] in string.punctuation else split+"." for split in splits] dp["input"] = TEMPLATES[dataset][method][0].format(splits[0], splits[1]) dp["output"] = TEMPLATES[dataset][method][1].format(map_option(dp["output"])) for i, options in enumerate(dp["options"]): dp["options"][i] =TEMPLATES[dataset][method][1].format(map_option(dp["options"][i])) else: def map_option(option): if dataset=="tweet_eval-hate": return {"hate": "against", "non-hate": "favor"}[option] return option dp["input"] = TEMPLATES[dataset][method][0].format(dp["input"]) dp["output"] = TEMPLATES[dataset][method][1].format(map_option(dp["output"])) for i, options in enumerate(dp["options"]): dp["options"][i] =TEMPLATES[dataset][method][1].format(map_option(dp["options"][i])) ================================================ FILE: test_gpt3.py ================================================ # Copyright (c) Facebook, Inc. and its affiliates. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. import os import argparse import pickle as pkl import random import torch import math import json import string import logging import numpy as np from tqdm import tqdm from collections import Counter, defaultdict from torch.utils.data import TensorDataset, DataLoader, SequentialSampler from transformers import GPT2Tokenizer, AutoTokenizer from metaicl.data import MetaICLData from metaicl.model import MetaICLModel from utils.data import load_data from gpt3 import GPT3Model def main(logger, args): assert (args.dataset is not None and args.task is None) or (args.dataset is None and args.task is not None) tokenizer = AutoTokenizer.from_pretrained("gpt2") add_newlines = True ### checkpoint ... if not args.do_zeroshot: if args.checkpoint is not None: checkpoint = args.checkpoint assert args.global_step is None else: assert args.global_step is not None checkpoint = os.path.join(args.out_dir, "model-{}.pt".format(args.global_step)) assert os.path.exists(checkpoint) else: add_newlines = False checkpoint = None metaicl_model = GPT3Model(args.gpt3, args.api, logger) if not os.path.exists(args.out_dir): os.makedirs(args.out_dir) # setup hyperparams for data max_length_per_example = 256 max_length = 256 if args.use_demonstrations: orig_max_length = max_length if args.do_zeroshot: max_length = min(max_length * args.k, 1024) else: max_length = min(max_length * args.k, 1024) logger.info("batch_size=%d\tmax_length=%d\tmax_length_per_example=%d" % ( args.test_batch_size, max_length, max_length_per_example)) metaicl_data = MetaICLData(logger, tokenizer, args.method,args.use_demonstrations, args.k, max_length, max_length_per_example) results = [] errors = [] seeds = args.seed.split(",") config_split = "unseen_domain_test" if args.unseen_domain_only else "test" for seed in seeds: ### data ... train_data = load_data(args.task, "train", args.k, seed=seed, config_split=config_split, datasets=None if args.dataset is None else args.dataset.split(",")) dev_data = load_data(args.task, args.split, args.k, seed=seed, config_split=config_split, datasets=None if args.dataset is None else args.dataset.split(","), is_null=args.is_null) train_counter = Counter() dev_counter = Counter() for dp in train_data: train_counter[dp["task"]] += 1 for dp in dev_data: dev_counter[dp["task"]] += 1 for k, v in train_counter.items(): logger.info("[Train] %s\t%d" % (k, v)) for k, v in dev_counter.items(): logger.info("[Dev] %s\t%d" % (k, v)) logger.info("%s on %s (%d train, %d dev)" % (args.method, args.task, len(train_counter), len(dev_counter))) for test_task in dev_counter: curr_dev_data = [dp for dp in dev_data if dp["task"]==test_task] curr_train_data = [dp for dp in train_data if dp["task"]==test_task] assert len(curr_dev_data)>0 assert not args.use_demonstrations or len(curr_train_data)==args.k, \ (args.use_demonstrations, len(curr_train_data), args.k) config_file = "config/tasks/{}.json".format(test_task) assert os.path.exists(config_file), config_file with open(config_file, "r") as f: config = json.load(f) is_classification = config["task_type"]=="classification" if is_classification: options = curr_dev_data[0]["options"] assert np.all([d["options"]==options for d in curr_dev_data+curr_train_data]) result = run(logger, test_task, metaicl_data, metaicl_model, curr_train_data, curr_dev_data, seed, checkpoint, is_classification, add_newlines) if result is None: errors.append("%s/%s" % (test_task, seed)) else: results.append(result) if args.is_null: return logger.info("Macro-F1 of %s over %d target tasks: %.1f" % (args.task, len(results) // len(seeds), 100*np.mean(results))) if len(errors)>0: logger.info("You had errors with datasets:", ",".join(errors)) logger.info("Please see the error messages") def run(logger, task, metaicl_data, metaicl_model, train_data, dev_data, seed, checkpoint, is_classification, add_newlines): if args.do_zeroshot: split_name = args.split if args.is_null: split_name += "-null" cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}{}.pkl".format( task, split_name, metaicl_data.method, "-k={}".format(args.k) if args.use_demonstrations else "", "-s={}".format(seed) if args.use_demonstrations else "", "" if add_newlines else "-no-newlines")) gpt3_cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}{}.json".format( task, split_name, metaicl_data.method, "-k={}".format(args.k) if args.use_demonstrations else "", "-s={}".format(seed) if args.use_demonstrations else "", "" if add_newlines else "-no-newlines")) else: assert add_newlines cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}.pkl".format( task, args.split, metaicl_data.method, "-k={}".format(args.k) if args.use_demonstrations else "", "-s={}".format(seed) if args.use_demonstrations else "" )) gp3_cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}.json".format( task, args.split, metaicl_data.method, "-k={}".format(args.k) if args.use_demonstrations else "", "-s={}".format(seed) if args.use_demonstrations else "" )) metaicl_data.tensorize(train_data, dev_data, add_newlines=add_newlines) gpt3_dataloader, gpt3_metadata = metaicl_model.prepare_data(train_data if args.use_demonstrations else [], dev_data, args.method, batch_size=args.test_batch_size) # metaicl_data.print_tensorized_example() logger.info(cache_path) if os.path.exists(cache_path): with open(cache_path, "rb") as f: losses = pkl.load(f) else: losses, gpt3cache = metaicl_model.do_inference(gpt3_dataloader) with open(cache_path, "wb") as f: pkl.dump(losses, f) with open(gpt3_cache_path, "w") as f: json.dump(gpt3cache, f) if args.is_null: return None if args.use_calibration: assert args.do_zeroshot bias_path = cache_path.replace("/"+task+"-"+args.split, "/"+task+"-"+args.split+"-null") assert os.path.exists(bias_path), bias_path with open(bias_path, "rb") as f: bias_losses = pkl.load(f) losses = np.array(losses) bias_losses = np.array(bias_losses) assert losses.shape == bias_losses.shape losses -= bias_losses predictions = metaicl_model.do_predict(losses=losses, metadata=gpt3_metadata) groundtruths = [dp["output"] for dp in dev_data] perf = metaicl_data.evaluate(predictions, groundtruths, is_classification) logger.info("Accuracy=%s" % perf) prediction_path = cache_path.replace(".pkl", ".txt") if args.use_calibration: prediction_path = prediction_path.replace(".txt", "-calibrated.txt") with open(prediction_path, "w") as f: for prediction in predictions: f.write(prediction) f.write("\n") return perf if __name__=='__main__': parser = argparse.ArgumentParser() parser.add_argument("--do_zeroshot", default=False, action="store_true") parser.add_argument("--use_demonstrations", default=False, action="store_true") parser.add_argument("--use_calibration", default=False, action="store_true") parser.add_argument("--unseen_domain_only", default=False, action="store_true") parser.add_argument("--log_file", default=None, type=str) parser.add_argument("--task", type=str, default=None) parser.add_argument("--dataset", type=str, default=None) parser.add_argument("--k", type=int, default=16) parser.add_argument("--seed", type=str, default="100") parser.add_argument("--test_batch_size", type=int, default=64) parser.add_argument("--global_step", type=str, default=None) parser.add_argument("--checkpoint", type=str, default=None) parser.add_argument("--out_dir", type=str, required=True) parser.add_argument("--split", type=str, default="test") parser.add_argument("--is_null", default=False, action="store_true") parser.add_argument("--method", type=str, default="direct", choices=["direct", "channel"]) parser.add_argument("--gpt3", type=str, default="davinci", choices=["ada", "babbage", "curie", "davinci"]) parser.add_argument("--api", type=str, required=True) args = parser.parse_args() handlers = [logging.StreamHandler()] if args.log_file is not None: handlers.append(logging.FileHandler(args.log_file)) logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', datefmt='%m/%d/%Y %H:%M:%S', level=logging.INFO, handlers=handlers) logger = logging.getLogger(__name__) logger.info(args) main(logger, args)