Repository: Alrope123/rethinking-demonstrations
Branch: main
Commit: d8267faf528b
Files: 5
Total size: 43.2 KB
Directory structure:
gitextract_6owm_f3k/
├── README.md
├── create_data.py
├── gpt3.py
├── templates.py
└── test_gpt3.py
================================================
FILE CONTENTS
================================================
================================================
FILE: README.md
================================================
# Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
This includes an original implementation of "[Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?][paper]" by [Sewon Min][sewon], [Xinxi Lyu][xinxi], [Ari Holtzman][ari], [Mikel Artetxe][mikel], [Mike Lewis][mike], [Hannaneh Hajishirzi][hanna], and [Luke Zettlemoyer][luke].
This code provides:
- Codes for creating the variants of the demonstrations used in the experiments.
- Commands to run the models and get numbers reported in the paper, based on the [MetaICL][metaicl] codebase.
Please leave issues for any questions about the paper or the code.
If you find our code or paper useful, please cite the paper:
```
@inproceedings{ min2022rethinking,
title={ Rethinking the Role of Demonstrations: What makes In-context Learning Work? },
author={ Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
booktitle={ EMNLP },
year={ 2022 }
}
```
### Announcements
* 07/25/2022: The code also supports running GPT-3 now.
* 02/25/2022: The code supports running GPT-2, MetaICL and GPT-J for now. Please contact authors for running other models.
## Content
1. [Preparation](#preparation)
2. [Reproducing Main Experiments](#reproducing-main-experiments) (Section 4.1 of the paper)
* [No Demonstrations](#no-demonstrations)
* [Demonstrations with gold labels](#demonstrations-with-gold-labels)
* [Demonstrations with random labels](#demonstrations-with-random-labels)
3. [Reproducing Ablations](#reproducing-ablations) (Section 4.2 of the paper)
* [Number of correct labels](#number-of-correct-labels)
* [Number of input-label pairs in the demonstrations](#number-of-input-label-pairs-in-the-demonstrations)
* [Using manual templates](#using-manual-templates)
4. [Reproducing Analysis](#reproducing-analysis) (Section 5 of the paper)
* [Demonstrations with OOD input text](#demonstrations-with-ood-input-text)
* [Demonstrations with random english words](#demonstrations-with-random-english-words)
* [Demonstrations with random labels only (no inputs)](#demonstrations-with-random-labels-only-no-inputs)
* [Demonstrations with no labels (inputs only)](#demonstrations-with-no-labels-inputs-only)
## Preparation
The code is tested with python 3.8.
The data and the code are based on the MetaICL codebase.
```bash
git remote add metaicl https://github.com/facebookresearch/MetaICL.git
git pull metaicl main --allow-unrelated-histories -X ours
```
Install the data dependencies and download the data.
```bash
conda conda create --name metaicl-data python=3.8
conda activate metaicl-data
pip install datasets==1.4.0 wget
cd preprocess
python _build_gym.py --build --n_proc=40 --do_test
```
This uses `k=16` by default. If you want to run ablations with varying `k`, please also run the following.
```bash
python _build_gym.py --build --n_proc=40 --do_test --test_k {4|8|32}
```
After preprocesisng is done, come back to the main directory.
```bash
cd ../
conda deactivate
```
Now, install the model dependencies to run the model. Please note that the Transformer version is not compatible to the datasets library used to download the data, so make sure to use a different environment.
```
conda conda create --name metaicl python=3.8
conda activate metaicl
pip install torch==1.9.0
pip install git+https://github.com/huggingface/transformers.git@c37573806ab3526dd805c49cbe2489ad4d68a9d7
```
(Optional) Install OpenAI Python Library for running GPT-3
```
pip install openai
```
## Reproducing Main Experiments
This is for reproducing experiments in Section 4.1 of the paper.
Evaluation datasets are:
* Classification (16 datasets): `financial_phrasebank`,`poem_sentiment`,`glue-wnli`,`climate_fever`,`glue-rte`,`superglue-cb`,`sick`,`medical_questions_pairs`,`glue-mrpc`,`hate_speech18`,`ethos-national_origin`,`ethos-race`,`ethos-religion`,`tweet_eval-hate`,`tweet_eval-stance_atheism`,`tweet_eval-stance_feminist`
* Multi-choice (10 datasets): `quarel`,`openbookqa`,`qasc`,`commonsense_qa`,`ai2_arc`,`codah`,`superglue-copa`,`dream`,`quartz-with_knowledge`,`quartz-no_knowledge`
#### No Demonstrations
To run the evaluation of No-Demonstrations:
```bash
# Direct GPT-2 Large
python test.py --dataset {dataset} --gpt2 gpt2-large --method direct --out_dir out/gpt2-large --do_zeroshot
# Channel GPT-2 Large
python test.py --dataset {dataset} --gpt2 gpt2-large --method channel --out_dir out/gpt2-large --do_zeroshot
# Direct MetaICL
python test.py --dataset {dataset} --gpt2 metaicl --method direct --out_dir out/metaicl --do_zeroshot
# Channel MetaICL
python test.py --dataset {dataset} --gpt2 channel-metaicl --method channel --out_dir out/channel-metaicl --do_zeroshot
# Direct GPT-J
python test.py --dataset {dataset} --gpt2 gpt-j-6B --method direct --out_dir out/gpt-j --do_zeroshot
# Channel GPT-J
python test.py --dataset {dataset} --gpt2 gpt-j-6B --method channel --out_dir out/gpt-j --do_zeroshot
# GPT-3
python test_gpt3.py --dataset {dataset} --gpt3 {ada|babbage|curie|davinci} --method {direct|channel} --out_dir out/gpt3 --do_zeroshot --api {API key}
```
Note that `test.py` and `test_gpt3.py` does not support multi-gpu for inference.
Other useful flags:
* `--test_batch_size`: can be adjusted based on your GPU memory availability. With a 32GB GPU, you can use 64 for GPT-2 Large & MetaICL, and 16 for GPT-J **with no demonstrations**. Later, when you run the code **with demonstrations**, decreasing the batch size by 4 times typically works, e.g., 16 (GPT-2 Large & MetaICL) and 4 (GPT-J) with a 32GB GPU.
* `--log_file`: if you want to save logs in a file, you can specify the path to the log file.
Notes for running GPT-3:
* You can create/check your OpenAI API keys by visiting [this link](https://beta.openai.com/account/api-keys).
* Running with GPT-3 can be expensive, and different models of GPT-3 comes with different costs. Please check [this link](https://openai.com/api/pricing/) to evaluate the cost before running each experiment.
* The responses from the GPT-3 API are cached in the `out_dir`.
From now on, we will use the above commands as a default and tell you which flags you need to add.
#### Demonstrations with gold labels
Run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87`.
#### Demonstrations with random labels
Create the demonstrations with random labels via:
```bash
python create_data.py --variant random --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_random`.
## Reproducing Ablations
This is for reproducing experiments in Section 4.2 of the paper.
Evaluation datasets are:
* Classification (5 datasets): `poem_sentiment`,`glue-rte`,`sick`,`glue-mrpc`,`tweet_eval-hate`
* Multi-choice (4 datasets): `openbookqa`,`commonsense_qa`,`ai2_arc`,`superglue-copa`
#### Number of correct labels
Create the demonstrations with varying number of correct labels via:
```bash
python create_data.py --variant {75|50|25|0}_correct --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_{75|50|25|0}_correct`.
#### Number of input-label pairs in the demonstrations
(Note that you should have run preprocessing with varying `k` to run this ablation. If you have not done this, please re-visit the [Preparation](#preparation) section.)
Create the demonstrations with varying `k` via:
```bash
python create_data.py --variant random --dataset {dataset} --k {4|8|16|32}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k {4|8|16|32} --seed 100,13,21,42,87 --dataset {dataset}_random`.
#### Using manual templates
Create the demonstrations with varying type of labels and inference method via:
```bash
python create_data.py --variant {gold|random}_w_template --dataset {dataset} --method {direct|channel}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_{gold|random}_w_template_{direct|channel}`.
## Reproducing Analysis
This is for reproducing experiments in Section 5 of the paper.
Evaluation datasets are:
* Classification (5 datasets): `poem_sentiment`,`glue-rte`,`sick`,`glue-mrpc`,`tweet_eval-hate`
* Multi-choice (4 datasets): `openbookqa`,`commonsense_qa`,`ai2_arc`,`superglue-copa`
#### Demonstrations with OOD input text
First, you need a corpus file in a .txt format, where each line is a sentence (in the plain text).
In the paper, we used samples from the English portion of CC News, which we are unable to release here.
Please visit [this link](https://commoncrawl.org/2016/10/news-dataset-available/) to learn more about how to download the CC News corpus.
Create the demonstrations with OOD input text via:
```bash
python create_data.py --variant ood_inputs --dataset {dataset} --corpus_path {corpus_path}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_ood_inputs`.
#### Demonstrations with random english words
Create the demonstrations with random English words as labels via:
```bash
python create_data.py --variant random_english_words --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed {seed} --dataset {dataset}_random_english_words_seed={seed}`, where `seed` can be one of 100, 13, 21, 42, and 87.
#### Demonstrations with random labels only (no inputs)
Create the demonstrations with random labels only via:
```bash
python create_data.py --variant random_labels_only --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_random_labels_only`.
#### Demonstrations with no labels (inputs only)
Create the demonstrations with no labels via:
```bash
python create_data.py --variant no_labels --dataset {dataset}
```
Then, run the commands same as [default commands](#no-demonstrations) but add `--use_demonstrations --k 16 --seed 100,13,21,42,87 --dataset {dataset}_no_labels`.
[paper]: https://arxiv.org/abs/2202.12837
[sewon]: http://shmsw25.github.io/
[xinxi]: https://alrope123.github.io/
[ari]: https://ari-holtzman.github.io/
[mikel]: https://scholar.google.com/citations?user=N5InzP8AAAAJ&hl=en
[mike]: https://ai.facebook.com/people/mike-lewis/
[hanna]: https://homes.cs.washington.edu/~hannaneh/index.html
[luke]: https://www.cs.washington.edu/people/faculty/lsz
[metaicl]: https://github.com/facebookresearch/MetaICL
================================================
FILE: create_data.py
================================================
import os
import argparse
import random
import json
import numpy as np
from collections import defaultdict, Counter
from templates import apply_template
def main(args):
assert args.variant in [
"gold", "random", # main experiments in Section 4
"75_correct", "50_correct", "25_correct", "0_correct", # ablations in Section 4
"gold_w_template", "random_w_template", # ablations in Section 4
"ood_inputs", "random_english_words", "random_labels_only", "no_labels", # Section 5
"random_english_words_gold_labels", "permutated_labels", "random_true_distribution"
]
if args.variant in ["gold_w_template", "random_w_template"]:
assert args.method is not None, "Please specify `--method` with the inference method (`direct` or `channel`) for using the template."
assert args.method in ["direct", "channel"], "Please make sure to use either `direct` or `channel`."
if args.variant=="gold":
print ("No need to run `create_data.py` --- you can use the original data as it is.")
return
if args.variant=="ood_inputs":
# load sources for OOD inputs
assert args.corpus_path is not None, \
"""
Please note that you need to specify the path to the corpus from which the OOD inputs will be sampled.
It should be a .txt file where each line contains a sentence (plain text).
"""
grouped_samples = defaultdict(list)
with open(args.corpus_path, "r") as f:
random_texts = []
random_text_lens = []
for line in f:
line = line.strip()
random_texts.append(line)
random_text_lens.append(len(line.split()))
random_text_lens = np.array(random_text_lens)
elif args.variant in ["random_english_words", "random_english_words_gold_labels"]:
from english_words import english_words_set
english_words_set = sorted(english_words_set)
datasets = args.dataset.split(',')
new_datasets = [dataset + "_" + args.variant + (("_" + args.method) if args.method is not None else "") for dataset in datasets]
assert len(datasets) == len(new_datasets)
################################################################################################################
seeds = args.seed.split(',')
perfs = []
for dataset_idx, (dataset, new_dataset) in enumerate(zip(datasets, new_datasets)):
# contruct and save a new config file and data directory
config_file = os.path.join(args.config_dir, "tasks")
assert os.path.exists(config_file), config_file
with open(os.path.join(config_file, "{}.json".format(dataset)), "r") as f:
config = json.load(f)
# in case of random English words, we will create a config file and data directory
# for each random seed later on (since the data is different across seeds)
if args.variant not in ["random_english_words", "random_english_words_gold_labels"]:
with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f:
json.dump(config, f)
new_dataset_dir = os.path.join(args.data_dir, new_dataset)
if not os.path.exists(new_dataset_dir):
os.mkdir(new_dataset_dir)
# load full training data to get the distribution of the labels
if args.variant=="random_true_distribution":
full_train_data_path = os.path.join(args.data_dir, dataset, "{}_16384_100_train.jsonl".format(dataset))
assert os.path.exists(full_train_data_path), "Please generate full training data first by running _build_gym.py with k=16384."
full_train_data_labels = []
with open(full_train_data_path, "r") as f:
for line in f:
dp = json.loads(line)
assert dp["task"]==dataset
full_train_data_labels.append(dp["output"])
train_label_counter = Counter(full_train_data_labels)
train_label_distribution = {label : train_label_counter[label] / len(full_train_data_labels) for label in train_label_counter}
for seed in seeds:
# random seed
np.random.seed(int(seed))
if args.variant in ["random_english_words", "random_english_words_gold_labels"]:
new_dataset = new_datasets[dataset_idx] + "_seed={}".format(seed)
# read the original training and test data
# note that we are modifying the training data only,
# and the test data will always be the same
# (we are creating duplicates only for convenience)
train_data = []
train_data_path = os.path.join(args.data_dir, dataset, "{}_{}_{}_{}.jsonl".format(dataset, args.k, seed, "train"))
with open(train_data_path, "r") as f:
for line in f:
dp = json.loads(line)
assert dp["task"]==dataset
dp["task"] = new_dataset
train_data.append(dp)
test_data = []
test_data_path = os.path.join(args.data_dir, dataset, "{}_{}_{}_{}.jsonl".format(dataset, args.k, seed, "test"))
with open(test_data_path, "r") as f:
for line in f:
dp = json.loads(line)
assert dp["task"]==dataset
dp["task"] = new_dataset
test_data.append(dp)
# apply templates to inputs and labels
if args.variant in ["gold_w_template", "random_w_template"]:
for dp in train_data:
apply_template(dp, dataset, args.method)
for dp in test_data:
apply_template(dp, dataset, args.method)
# now, for random_english_words, create a config file and data directory
if args.variant in ["random_english_words", "random_english_words_gold_labels"]:
new_dataset_dir = os.path.join(args.data_dir, new_dataset)
if not os.path.exists(new_dataset_dir):
os.mkdir(new_dataset_dir)
if config["task_type"]=="classification":
new_options = list(np.random.choice(english_words_set, size=len(config["options"]), replace=False))
new_mapping = {option: new_option for option, new_option in zip(config["options"], new_options)}
new_config = config.copy()
new_config["options"] = new_options
with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f:
json.dump(new_config, f)
for i, dp in enumerate(train_data):
train_data[i]["output"] = new_mapping[dp["output"]]
train_data[i]["options"] = [new_mapping[option] for option in dp["options"]]
if args.variant == "random_english_words_gold_labels":
# also modify the test data for classification tasks
for i, dp in enumerate(test_data):
test_data[i]["output"] = new_mapping[dp["output"]]
test_data[i]["options"] = [new_mapping[option] for option in dp["options"]]
elif config["task_type"]=="multi-choice":
with open(os.path.join(config_file, "{}.json".format(new_dataset)), "w") as f:
json.dump(config, f)
shuffled_indices = np.random.permutation(range(len(english_words_set)))
shuffled_options = [english_words_set[i] for i in shuffled_indices]
offset = 0
for i, dp in enumerate(train_data):
new_options = shuffled_options[offset:offset+len(dp["options"])]
offset += len(dp["options"])
train_data[i]["output"] = new_options[dp["options"].index(dp["output"])]
train_data[i]["options"] = new_options
else:
raise NotImplementedError()
# modify both train input and test input for permutated_labels with classification tasks
if args.variant == "permutated_labels" and config["task_type"]=="classification":
old_options = config["options"]
new_options = [old_options[(i+1)%len(old_options)] for i in range(len(old_options))]
new_mapping = {old_option: new_option for old_option, new_option in zip(old_options, new_options)}
for i, dp in enumerate(train_data):
train_data[i]["output"] = new_mapping[dp["output"]]
for i, dp in enumerate(test_data):
test_data[i]["output"] = new_mapping[dp["output"]]
## modify labels in the training data
if args.variant in ["75_correct", "50_correct", "25_correct"]:
num_correct = args.k * int(args.variant.split("_")[0]) // 100
indices_correct = np.random.permutation(range(args.k))[:num_correct]
for dp_idx, dp in enumerate(train_data):
if args.variant in ["gold", "gold_w_template", "permutated_labels", "random_english_words_gold_labels"] or \
(args.variant in ["75_correct", "50_correct", "25_correct"] and dp_idx in indices_correct):
# assign correct label
pass
elif args.variant.endswith("_correct"):
# assign incorrect label
dp["output"] = dp["options"][np.random.choice([i for i in range(len(dp["options"])) if dp["options"][i] != dp["output"]])]
elif args.variant=="no_labels":
# assign empty label
dp["output"] = ""
dp["options"] = [""]
elif args.variant=="random_true_distribution":
# assign random labels according to the distribution in the training data
dp["output"] = np.random.choice(list(train_label_distribution.keys()), p=list(train_label_distribution.values()))
else:
# assign random label
dp["output"] = np.random.choice(dp["options"])
## modify inputs in the training data
if args.variant=="random_labels_only":
for dp in train_data:
dp["input"] = ""
elif args.variant=="ood_inputs":
new_train_data = []
for dp in test_data:
l = len(dp["input"].split())
prob = np.exp(-np.power(random_text_lens-l, 2)/50)
prob /= np.sum(prob)
samples = np.random.choice(random_texts, size=args.k, replace=False, p=prob)
assert len(samples)==len(train_data)
new_train_data.append([])
for train_dp, sample in zip(train_data, samples):
new_train_data[-1].append({"task": train_dp["task"],
"input": sample,
"output": train_dp["output"],
"options": train_dp["options"]})
train_data = new_train_data
# write the modified data
with open(os.path.join(new_dataset_dir, "{}_{}_{}_{}.jsonl".format(new_dataset, args.k, seed, "train")), "w") as f:
for dp in train_data:
f.write(json.dumps(dp))
f.write("\n")
with open(os.path.join(new_dataset_dir, "{}_{}_{}_{}.jsonl".format(new_dataset, args.k, seed, "test")), "w") as f:
for dp in test_data:
f.write(json.dumps(dp))
f.write("\n")
print ("Done for %s seed=%s" % (new_dataset, seed))
if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument("--k", type=int, default=16)
parser.add_argument("--seed", type=str, default="100,13,21,42,87")
parser.add_argument("--variant", type=str, default="random", required=True)
parser.add_argument("--method", type=str, default=None)
parser.add_argument("--data_dir", type=str, default="data")
parser.add_argument("--config_dir", type=str, default="config")
parser.add_argument("--corpus_path", type=str, default=None)
args = parser.parse_args()
main(args)
================================================
FILE: gpt3.py
================================================
import time
import sys
import numpy as np
import torch
import json
import openai
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
from transformers import GPT2Tokenizer
class GPT3Model(object):
def __init__(self, model_name, api_key, logger=None):
self.model_name = model_name
try:
openai.api_key = api_key
except Exception:
pass
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl")
self.logger=logger
def prepare_data(self, train_data, test_data, method, batch_size=10, dp_sep="\n", max_length=1024):
# format demonstrations
demonstrations = ""
for dp in train_data:
if method=="direct":
demonstrations += "{}{}{}\n\n\n".format(dp["input"], dp_sep, dp["output"])
elif method=="channel":
demonstrations += "{}{}{}\n\n\n".format(dp["output"], dp_sep, dp["input"])
else:
raise NotImplementedError()
# append demonstrations and separate options
inputs = []
outputs = []
metadata = []
for dp in test_data:
prompt = dp["input"]
options = dp["options"]
indices = [i for i in range(len(inputs), len(inputs) + len(options))]
metadata.append({"indices": indices, "options": options})
if method=="direct":
inputs += [demonstrations + prompt + dp_sep for option in options]
outputs += [option for option in options]
elif method=="channel":
inputs += [demonstrations + option + dp_sep for option in options]
outputs += [prompt for option in options]
else:
raise NotImplementedError()
# truncate inputs
for i, (inp, out) in enumerate(zip(inputs, outputs)):
input_ids = self.tokenizer.encode(inp)
output_ids = self.tokenizer.encode(out)
if (len(input_ids) + len(output_ids) > max_length):
input_ids = input_ids[len(input_ids)+len(output_ids) - max_length:]
assert len(input_ids)+len(output_ids) == max_length
inputs[i] = self.tokenizer.decode(input_ids)
if self.logger is not None:
self.logger.info("Checking the first example...")
self.logger.info(inputs[0] + "" + outputs[0])
# construct a dataloader
dataset = zip(inputs, outputs)
input_chunks = [inputs[i : i + batch_size] for i in range(0, len(inputs), batch_size)]
output_chunks = [outputs[i : i + batch_size] for i in range(0, len(outputs), batch_size)]
dataloader = [(input_chunks[i], output_chunks[i]) for i in range(0, len(input_chunks))]
return dataloader, metadata
def do_inference(self, dataloader):
losses = []
cache = []
cost = 0
for inputs, outputs in dataloader:
data = [inp + out for inp, out in zip(inputs, outputs)]
response = self.gpt3(data)
for choice in response["choices"]:
cost += len(choice["logprobs"]["tokens"]) * 0.00006
# print("current cost = " + str(cost))
cache.append((data, response))
# get the beginning of the target from the response (based on tokenization)
for inp, outp, out in zip(inputs, outputs, response["choices"]):
assert inp+outp==out["text"]
i = 0
while out['logprobs']['text_offset'][i] < len(inp):
i += 1
loss = -sum(out['logprobs']["token_logprobs"][i:])
losses.append(loss / (len(out['logprobs']['text_offset']) - i))
return losses, cache
def do_predict(self, losses, metadata):
predictions = []
for dp in metadata:
curr_label_losses = [losses[index] for index in dp["indices"]]
prediction_idx = sorted(enumerate(curr_label_losses), key=lambda x: x[1])[0][0]
prediction = dp["options"][prediction_idx]
predictions.append(prediction.strip())
return predictions
def gpt3(self, prompt, max_len=0, temp=0, num_log_probs=0, echo=True, n=None):
# call GPT-3 API until result is provided and then return it
response = None
received = False
while not received:
try:
response = openai.Completion.create(engine=self.model_name,
prompt=prompt,
max_tokens=max_len,
temperature=temp,
logprobs=num_log_probs,
echo=echo,
stop='\n',
n=n)
received = True
except:
error = sys.exc_info()[0]
if error == openai.error.InvalidRequestError:
# something is wrong: e.g. prompt too long
print(f"InvalidRequestError\nPrompt passed in:\n\n{prompt}\n\n")
assert False
print("API error:", error)
time.sleep(1)
return response
================================================
FILE: templates.py
================================================
import string
TEMPLATES = {
"financial_phrasebank": {
"direct" : ("{}", "The sentiment is: {}"),
"channel": ("{}", "The sentiment is: {}")
},
"poem_sentiment": {
"direct" : ("{}", "The sentiment is: {}"),
"channel": ("{}", "The sentiment is: {}")
},
"glue-mrpc": {
"direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"),
"channel": ("The question is: {} True or False?\n{}", "The answer is: {}")
},
"glue-rte": {
"direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"),
"channel": ("The question is: {} True or False?\n{}", "The answer is: {}")
},
"sick": {
"direct" : ("{}\nThe question is: {} True or False?", "The answer is: {}"),
"channel": ("The question is: {} True or False?\n{}", "The answer is: {}")
},
"tweet_eval-hate": {
"direct" : ("Tweet: {}", "Sentiment: {}"),
"channel": ("Tweet: {}", "Sentiment: {}"),
},
"openbookqa": {
"direct" : ("The question is: {}", "The answer is: {}"),
"channel": ("The question is: {}", "The answer is: {}")
},
"ai2_arc": {
"direct" : ("The question is: {}", "The answer is: {}"),
"channel": ("The question is: {}", "The answer is: {}")
},
"codah": {
"direct" : ("The question is: {}", "The answer is: {}"),
"channel": ("The question is: {}", "The answer is: {}")
},
"commonsense_qa": {
"direct" : ("The question is: {}", "The answer is: {}"),
"channel": ("The question is: {}", "The answer is: {}")
}
}
def apply_template(dp, dataset, method):
if dataset.startswith("superglue-copa"):
if method == "direct":
if dp["input"].startswith("Cause: "):
dp["input"] = dp["input"][7:-1] + " so"
dp["output"] = dp["output"][8].lower() + dp["output"][9:]
for i, options in enumerate(dp["options"]):
dp["options"][i] = dp["options"][i][8].lower() + dp["options"][i][9:]
elif dp["input"].startswith("Effect: "):
dp["input"] = dp["input"][8:-1] + " because"
dp["output"] = dp["output"][7].lower() + dp["output"][8:]
for i, options in enumerate(dp["options"]):
dp["options"][i] = dp["options"][i][7].lower() + dp["options"][i][8:]
else:
raise NotImplementedError()
elif method == "channel":
if dp["output"].startswith("Cause: "):
dp["output"] = dp["output"][7:-1] + " so"
dp["input"] = dp["input"][8].lower() + dp["input"][9:]
for i, options in enumerate(dp["options"]):
dp["options"][i] = dp["options"][i][7:-1] + " so"
elif dp["output"].startswith("Effect: "):
dp["output"] = dp["output"][8:-1] + " because"
dp["input"] = dp["input"][7].lower() + dp["input"][8:]
for i, options in enumerate(dp["options"]):
dp["options"][i] = dp["options"][i][8:-1] + " because"
else:
raise NotImplementedError(o)
elif dataset.startswith("glue") or dataset.startswith("sick"):
def map_option(option):
if option in ["equivalent", "entailment"]:
return "True"
if option in ["not_equivalent", "not_entailment", "contradiction"]:
return "False"
if option in ["neutral"]:
return "Not sure"
raise NotImplementedError(option)
dp["input"] = dp["input"].replace("sentence 1: ", "").replace("sentence 2: ", "")
splits = dp["input"].split(" [SEP] ")
if method=="channel":
splits = [splits[1], splits[0]]
splits = [split if split[-1] in string.punctuation else split+"." for split in splits]
dp["input"] = TEMPLATES[dataset][method][0].format(splits[0], splits[1])
dp["output"] = TEMPLATES[dataset][method][1].format(map_option(dp["output"]))
for i, options in enumerate(dp["options"]):
dp["options"][i] =TEMPLATES[dataset][method][1].format(map_option(dp["options"][i]))
else:
def map_option(option):
if dataset=="tweet_eval-hate":
return {"hate": "against", "non-hate": "favor"}[option]
return option
dp["input"] = TEMPLATES[dataset][method][0].format(dp["input"])
dp["output"] = TEMPLATES[dataset][method][1].format(map_option(dp["output"]))
for i, options in enumerate(dp["options"]):
dp["options"][i] =TEMPLATES[dataset][method][1].format(map_option(dp["options"][i]))
================================================
FILE: test_gpt3.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
import argparse
import pickle as pkl
import random
import torch
import math
import json
import string
import logging
import numpy as np
from tqdm import tqdm
from collections import Counter, defaultdict
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
from transformers import GPT2Tokenizer, AutoTokenizer
from metaicl.data import MetaICLData
from metaicl.model import MetaICLModel
from utils.data import load_data
from gpt3 import GPT3Model
def main(logger, args):
assert (args.dataset is not None and args.task is None) or (args.dataset is None and args.task is not None)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
add_newlines = True
### checkpoint ...
if not args.do_zeroshot:
if args.checkpoint is not None:
checkpoint = args.checkpoint
assert args.global_step is None
else:
assert args.global_step is not None
checkpoint = os.path.join(args.out_dir, "model-{}.pt".format(args.global_step))
assert os.path.exists(checkpoint)
else:
add_newlines = False
checkpoint = None
metaicl_model = GPT3Model(args.gpt3, args.api, logger)
if not os.path.exists(args.out_dir):
os.makedirs(args.out_dir)
# setup hyperparams for data
max_length_per_example = 256
max_length = 256
if args.use_demonstrations:
orig_max_length = max_length
if args.do_zeroshot:
max_length = min(max_length * args.k, 1024)
else:
max_length = min(max_length * args.k, 1024)
logger.info("batch_size=%d\tmax_length=%d\tmax_length_per_example=%d" % (
args.test_batch_size, max_length, max_length_per_example))
metaicl_data = MetaICLData(logger, tokenizer, args.method,args.use_demonstrations, args.k,
max_length, max_length_per_example)
results = []
errors = []
seeds = args.seed.split(",")
config_split = "unseen_domain_test" if args.unseen_domain_only else "test"
for seed in seeds:
### data ...
train_data = load_data(args.task, "train", args.k, seed=seed, config_split=config_split,
datasets=None if args.dataset is None else args.dataset.split(","))
dev_data = load_data(args.task, args.split, args.k, seed=seed, config_split=config_split,
datasets=None if args.dataset is None else args.dataset.split(","), is_null=args.is_null)
train_counter = Counter()
dev_counter = Counter()
for dp in train_data:
train_counter[dp["task"]] += 1
for dp in dev_data:
dev_counter[dp["task"]] += 1
for k, v in train_counter.items():
logger.info("[Train] %s\t%d" % (k, v))
for k, v in dev_counter.items():
logger.info("[Dev] %s\t%d" % (k, v))
logger.info("%s on %s (%d train, %d dev)" % (args.method, args.task, len(train_counter), len(dev_counter)))
for test_task in dev_counter:
curr_dev_data = [dp for dp in dev_data if dp["task"]==test_task]
curr_train_data = [dp for dp in train_data if dp["task"]==test_task]
assert len(curr_dev_data)>0
assert not args.use_demonstrations or len(curr_train_data)==args.k, \
(args.use_demonstrations, len(curr_train_data), args.k)
config_file = "config/tasks/{}.json".format(test_task)
assert os.path.exists(config_file), config_file
with open(config_file, "r") as f:
config = json.load(f)
is_classification = config["task_type"]=="classification"
if is_classification:
options = curr_dev_data[0]["options"]
assert np.all([d["options"]==options for d in curr_dev_data+curr_train_data])
result = run(logger, test_task, metaicl_data, metaicl_model,
curr_train_data, curr_dev_data, seed, checkpoint, is_classification, add_newlines)
if result is None:
errors.append("%s/%s" % (test_task, seed))
else:
results.append(result)
if args.is_null:
return
logger.info("Macro-F1 of %s over %d target tasks: %.1f" % (args.task, len(results) // len(seeds), 100*np.mean(results)))
if len(errors)>0:
logger.info("You had errors with datasets:", ",".join(errors))
logger.info("Please see the error messages")
def run(logger, task, metaicl_data, metaicl_model, train_data, dev_data, seed,
checkpoint, is_classification, add_newlines):
if args.do_zeroshot:
split_name = args.split
if args.is_null:
split_name += "-null"
cache_path = os.path.join(args.out_dir,
"{}-{}-{}{}{}{}.pkl".format(
task,
split_name,
metaicl_data.method,
"-k={}".format(args.k) if args.use_demonstrations else "",
"-s={}".format(seed) if args.use_demonstrations else "",
"" if add_newlines else "-no-newlines"))
gpt3_cache_path = os.path.join(args.out_dir,
"{}-{}-{}{}{}{}.json".format(
task,
split_name,
metaicl_data.method,
"-k={}".format(args.k) if args.use_demonstrations else "",
"-s={}".format(seed) if args.use_demonstrations else "",
"" if add_newlines else "-no-newlines"))
else:
assert add_newlines
cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}.pkl".format(
task,
args.split,
metaicl_data.method,
"-k={}".format(args.k) if args.use_demonstrations else "",
"-s={}".format(seed) if args.use_demonstrations else ""
))
gp3_cache_path = os.path.join(args.out_dir, "{}-{}-{}{}{}.json".format(
task,
args.split,
metaicl_data.method,
"-k={}".format(args.k) if args.use_demonstrations else "",
"-s={}".format(seed) if args.use_demonstrations else ""
))
metaicl_data.tensorize(train_data, dev_data, add_newlines=add_newlines)
gpt3_dataloader, gpt3_metadata = metaicl_model.prepare_data(train_data if args.use_demonstrations else [],
dev_data, args.method, batch_size=args.test_batch_size)
# metaicl_data.print_tensorized_example()
logger.info(cache_path)
if os.path.exists(cache_path):
with open(cache_path, "rb") as f:
losses = pkl.load(f)
else:
losses, gpt3cache = metaicl_model.do_inference(gpt3_dataloader)
with open(cache_path, "wb") as f:
pkl.dump(losses, f)
with open(gpt3_cache_path, "w") as f:
json.dump(gpt3cache, f)
if args.is_null:
return None
if args.use_calibration:
assert args.do_zeroshot
bias_path = cache_path.replace("/"+task+"-"+args.split, "/"+task+"-"+args.split+"-null")
assert os.path.exists(bias_path), bias_path
with open(bias_path, "rb") as f:
bias_losses = pkl.load(f)
losses = np.array(losses)
bias_losses = np.array(bias_losses)
assert losses.shape == bias_losses.shape
losses -= bias_losses
predictions = metaicl_model.do_predict(losses=losses, metadata=gpt3_metadata)
groundtruths = [dp["output"] for dp in dev_data]
perf = metaicl_data.evaluate(predictions, groundtruths, is_classification)
logger.info("Accuracy=%s" % perf)
prediction_path = cache_path.replace(".pkl", ".txt")
if args.use_calibration:
prediction_path = prediction_path.replace(".txt", "-calibrated.txt")
with open(prediction_path, "w") as f:
for prediction in predictions:
f.write(prediction)
f.write("\n")
return perf
if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--do_zeroshot", default=False, action="store_true")
parser.add_argument("--use_demonstrations", default=False, action="store_true")
parser.add_argument("--use_calibration", default=False, action="store_true")
parser.add_argument("--unseen_domain_only", default=False, action="store_true")
parser.add_argument("--log_file", default=None, type=str)
parser.add_argument("--task", type=str, default=None)
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument("--k", type=int, default=16)
parser.add_argument("--seed", type=str, default="100")
parser.add_argument("--test_batch_size", type=int, default=64)
parser.add_argument("--global_step", type=str, default=None)
parser.add_argument("--checkpoint", type=str, default=None)
parser.add_argument("--out_dir", type=str, required=True)
parser.add_argument("--split", type=str, default="test")
parser.add_argument("--is_null", default=False, action="store_true")
parser.add_argument("--method", type=str, default="direct", choices=["direct", "channel"])
parser.add_argument("--gpt3", type=str, default="davinci", choices=["ada", "babbage", "curie", "davinci"])
parser.add_argument("--api", type=str, required=True)
args = parser.parse_args()
handlers = [logging.StreamHandler()]
if args.log_file is not None:
handlers.append(logging.FileHandler(args.log_file))
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt='%m/%d/%Y %H:%M:%S',
level=logging.INFO,
handlers=handlers)
logger = logging.getLogger(__name__)
logger.info(args)
main(logger, args)
gitextract_6owm_f3k/ ├── README.md ├── create_data.py ├── gpt3.py ├── templates.py └── test_gpt3.py
SYMBOL INDEX (10 symbols across 4 files)
FILE: create_data.py
function main (line 11) | def main(args):
FILE: gpt3.py
class GPT3Model (line 11) | class GPT3Model(object):
method __init__ (line 13) | def __init__(self, model_name, api_key, logger=None):
method prepare_data (line 23) | def prepare_data(self, train_data, test_data, method, batch_size=10, d...
method do_inference (line 76) | def do_inference(self, dataloader):
method do_predict (line 98) | def do_predict(self, losses, metadata):
method gpt3 (line 108) | def gpt3(self, prompt, max_len=0, temp=0, num_log_probs=0, echo=True, ...
FILE: templates.py
function apply_template (line 46) | def apply_template(dp, dataset, method):
FILE: test_gpt3.py
function main (line 31) | def main(logger, args):
function run (line 132) | def run(logger, task, metaicl_data, metaicl_model, train_data, dev_data,...
Condensed preview — 5 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (46K chars).
[
{
"path": "README.md",
"chars": 10939,
"preview": "# Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?\n\nThis includes an original implementation "
},
{
"path": "create_data.py",
"chars": 12762,
"preview": "import os\nimport argparse\nimport random\nimport json\nimport numpy as np\n\nfrom collections import defaultdict, Counter\n\nfr"
},
{
"path": "gpt3.py",
"chars": 5400,
"preview": "import time\nimport sys\nimport numpy as np\nimport torch\nimport json\nimport openai\nfrom torch.utils.data import TensorData"
},
{
"path": "templates.py",
"chars": 4727,
"preview": "import string\n\nTEMPLATES = {\n \"financial_phrasebank\": {\n \"direct\" : (\"{}\", \"The sentiment is: {}\"),\n \"c"
},
{
"path": "test_gpt3.py",
"chars": 10428,
"preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the lic"
}
]
About this extraction
This page contains the full source code of the Alrope123/rethinking-demonstrations GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 5 files (43.2 KB), approximately 10.4k tokens, and a symbol index with 10 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.