[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n#.idea/\n"
  },
  {
    "path": "README.md",
    "content": "<h1 align=\"center\">The Practical Guides for Large Language Models </h1>\n\n\n<p align=\"center\">\n\t<img src=\"https://camo.githubusercontent.com/64f8905651212a80869afbecbf0a9c52a5d1e70beab750dea40a994fa9a9f3c6/68747470733a2f2f617765736f6d652e72652f62616467652e737667\" alt=\"Awesome\" data-canonical-src=\"https://awesome.re/badge.svg\" style=\"max-width: 100%;\">\t     \n</p>\n\nA curated (still actively updated) list of practical guide resources of LLMs. It's based on our survey paper: [Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond](https://arxiv.org/abs/2304.13712) and efforts from @[xinyadu](https://github.com/xinyadu). The survey is partially based on the second half of this [Blog](https://jingfengyang.github.io/gpt). We also build an evolutionary tree of modern Large Language Models (LLMs) to trace the development of language models in recent years and highlights some of the most well-known models. \n\nThese sources aim to help practitioners navigate the vast landscape of large language models (LLMs) and their applications in natural language processing (NLP) applications. We also include their usage restrictions based on the model and data licensing information.\nIf you find any resources in our repository helpful, please feel free to use them (don't forget to cite our paper! 😃). We welcome pull requests to refine this figure! \n\n<p align=\"center\">\n<img width=\"600\" src=\"./imgs/tree.jpg\"/>\n</p>\n\n\n```bibtex\n    @article{yang2023harnessing,\n        title={Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond}, \n        author={Jingfeng Yang and Hongye Jin and Ruixiang Tang and Xiaotian Han and Qizhang Feng and Haoming Jiang and Bing Yin and Xia Hu},\n        year={2023},\n        eprint={2304.13712},\n        archivePrefix={arXiv},\n        primaryClass={cs.CL}\n    }\n```\n\n## Latest News💥\n- We added usage and restrictions section.\n- We used PowerPoint to plot the figure and released the source file [pptx](./source/figure_gif.pptx) for our GIF figure. [4/27/2023]\n- We released the source file for the still version [pptx](./source/figure_still.pptx), and replaced the figure in this repo with the still version. [4/29/2023]\n- Add AlexaTM, UniLM, UniLMv2 to the figure, and correct the logo for Tk. [4/29/2023]\n- Add usage and Restrictions (for commercial and research purposes) section. Credits to [Dr. Du](https://github.com/xinyadu).  [5/8/2023]\n\n\n\n\n## Other Practical Guides for LLMs\n\n- **Why did all of the public reproduction of GPT-3 fail? In which tasks should we use GPT-3.5/ChatGPT?** 2023, [Blog](https://jingfengyang.github.io/gpt) \n- **Building LLM applications for production**, 2023, [Blog](https://huyenchip.com/2023/04/11/llm-engineering.html)\n- **Data-centric Artificial Intelligence**, 2023, [Repo](https://github.com/daochenzha/data-centric-AI)/[Blog](https://towardsdatascience.com/what-are-the-data-centric-ai-concepts-behind-gpt-models-a590071bb727)/[Paper](https://arxiv.org/abs/2303.10158)\n\n\n## Catalog\n* [The Practical Guides for Large Language Models ](#the-practical-guides-for-large-language-models-)\n   * [Practical Guide for Models](#practical-guide-for-models)\n      * [BERT-style Language Models: Encoder-Decoder or Encoder-only](#bert-style-language-models-encoder-decoder-or-encoder-only)\n      * [GPT-style Language Models: Decoder-only](#gpt-style-language-models-decoder-only)\n   * [Practical Guide for Data](#practical-guide-for-data)\n      * [Pretraining data](#pretraining-data)\n      * [Finetuning data](#finetuning-data)\n      * [Test data/user data](#test-datauser-data)\n   * [Practical Guide for NLP Tasks](#practical-guide-for-nlp-tasks)\n      * [Traditional NLU tasks](#traditional-nlu-tasks)\n      * [Generation tasks](#generation-tasks)\n      * [Knowledge-intensive tasks](#knowledge-intensive-tasks)\n      * [Abilities with Scaling](#abilities-with-scaling)\n      * [Specific tasks](#specific-tasks)\n      * [Real-World ''Tasks''](#real-world-tasks)\n      * [Efficiency](#efficiency)\n      * [Trustworthiness](#trustworthiness)\n      * [Benchmark Instruction Tuning](#benchmark-instruction-tuning)\n      * [Alignment](#alignment)\n         * [Safety Alignment (Harmless)](#safety-alignment-harmless)\n         * [Truthfulness Alignment (Honest)](#truthfulness-alignment-honest)\n         * [Practical Guides for Prompting (Helpful)](#practical-guides-for-prompting-helpful)\n         * [Alignment Efforts of Open-source Communtity](#alignment-efforts-of-open-source-communtity)\n   * [Usage and Restractions (Models and Data)](#Usage-and-Restrictions)\n\n## Practical Guide for Models\n\n### BERT-style Language Models: Encoder-Decoder or Encoder-only\n\n- BERT **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**, 2018, [Paper](https://aclanthology.org/N19-1423.pdf)\n- RoBERTa **RoBERTa: A Robustly Optimized BERT Pretraining Approach**, 2019, [Paper](https://arxiv.org/abs/1907.11692)\n- DistilBERT **DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter**, 2019, [Paper](https://arxiv.org/abs/1910.01108)\n- ALBERT **ALBERT: A Lite BERT for Self-supervised Learning of Language Representations**, 2019, [Paper](https://arxiv.org/abs/1909.11942)\n- UniLM **Unified Language Model Pre-training for Natural Language Understanding and Generation**, 2019 [Paper](https://arxiv.org/abs/1905.03197)\n- ELECTRA **ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS**, 2020, [Paper](https://openreview.net/pdf?id=r1xMH1BtvB)\n- T5 **\"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer\"**. *Colin Raffel et al.* JMLR 2019. [Paper](https://arxiv.org/abs/1910.10683)\n- GLM **\"GLM-130B: An Open Bilingual Pre-trained Model\"**. 2022. [Paper](https://arxiv.org/abs/2210.02414)\n- AlexaTM **\"AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model\"**. *Saleh Soltan et al.* arXiv 2022. [Paper](https://arxiv.org/abs/2208.01448)\n- ST-MoE **ST-MoE: Designing Stable and Transferable Sparse Expert Models**. 2022 [Paper](https://arxiv.org/abs/2202.08906)\n\n\n### GPT-style Language Models: Decoder-only\n\n- GPT **Improving Language Understanding by Generative Pre-Training**. 2018. [Paper](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)\n- GPT-2 **Language Models are Unsupervised Multitask Learners**. 2018. [Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)\n- GPT-3 **\"Language Models are Few-Shot Learners\"**. NeurIPS 2020. [Paper](https://arxiv.org/abs/2005.14165)\n- OPT **\"OPT: Open Pre-trained Transformer Language Models\"**. 2022. [Paper](https://arxiv.org/abs/2205.01068)\n- PaLM **\"PaLM: Scaling Language Modeling with Pathways\"**. *Aakanksha Chowdhery et al.* arXiv 2022. [Paper](https://arxiv.org/abs/2204.02311)\n- BLOOM  **\"BLOOM: A 176B-Parameter Open-Access Multilingual Language Model\"**. 2022. [Paper](https://arxiv.org/abs/2211.05100)\n- MT-NLG **\"Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model\"**. 2021. [Paper](https://arxiv.org/abs/2201.11990)\n- GLaM **\"GLaM: Efficient Scaling of Language Models with Mixture-of-Experts\"**. ICML 2022. [Paper](https://arxiv.org/abs/2112.06905)\n- Gopher **\"Scaling Language Models: Methods, Analysis & Insights from Training Gopher\"**. 2021. [Paper](http://arxiv.org/abs/2112.11446v2)\n- chinchilla **\"Training Compute-Optimal Large Language Models\"**. 2022. [Paper](https://arxiv.org/abs/2203.15556)\n- LaMDA **\"LaMDA: Language Models for Dialog Applications\"**. 2021. [Paper](https://arxiv.org/abs/2201.08239)\n- LLaMA **\"LLaMA: Open and Efficient Foundation Language Models\"**. 2023. [Paper](https://arxiv.org/abs/2302.13971v1)\n- GPT-4 **\"GPT-4 Technical Report\"**. 2023. [Paper](http://arxiv.org/abs/2303.08774v2)\n- BloombergGPT **BloombergGPT: A Large Language Model for Finance**, 2023, [Paper](https://arxiv.org/abs/2303.17564)\n- GPT-NeoX-20B: **\"GPT-NeoX-20B: An Open-Source Autoregressive Language Model\"**. 2022. [Paper](https://arxiv.org/abs/2204.06745)\n- PaLM 2: **\"PaLM 2 Technical Report\"**. 2023. [Tech.Report](https://arxiv.org/abs/2305.10403)\n- LLaMA 2: **\"Llama 2: Open foundation and fine-tuned chat models\"**. 2023. [Paper](https://arxiv.org/pdf/2307.09288)\n- Claude 2: **\"Model Card and Evaluations for Claude Models\"**. 2023. [Model Card](https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf)\n\n\n\n## Practical Guide for Data\n\n\n### Pretraining data\n- **RedPajama**, 2023. [Repo](https://github.com/togethercomputer/RedPajama-Data)\n- **The Pile: An 800GB Dataset of Diverse Text for Language Modeling**, Arxiv 2020. [Paper](https://arxiv.org/abs/2101.00027)\n- **How does the pre-training objective affect what large language models learn about linguistic properties?**, ACL 2022. [Paper](https://aclanthology.org/2022.acl-short.16/)\n- **Scaling laws for neural language models**, 2020. [Paper](https://arxiv.org/abs/2001.08361)\n- **Data-centric artificial intelligence: A survey**, 2023. [Paper](https://arxiv.org/abs/2303.10158)\n- **How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources**, 2022. [Blog](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1)\n### Finetuning data\n- **Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach**, EMNLP 2019. [Paper](https://arxiv.org/abs/1909.00161)\n- **Language Models are Few-Shot Learners**, NIPS 2020. [Paper](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)\n- **Does Synthetic Data Generation of LLMs Help Clinical Text Mining?** Arxiv 2023 [Paper](https://arxiv.org/abs/2303.04360)\n### Test data/user data\n- **Shortcut learning of large language models in natural language understanding: A survey**, Arxiv 2023. [Paper](https://arxiv.org/abs/2208.11857)\n- **On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective** Arxiv, 2023. [Paper](https://arxiv.org/abs/2302.12095)\n- **SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems** Arxiv 2019. [Paper](https://arxiv.org/abs/1905.00537)\n\n\n\n\n\n## Practical Guide for NLP Tasks\nWe build a decision flow for choosing LLMs or fine-tuned models~\\protect\\footnotemark for user's NLP applications. The decision flow helps users assess whether their downstream NLP applications at hand meet specific conditions and, based on that evaluation, determine whether LLMs or fine-tuned models are the most suitable choice for their applications.\n<p align=\"center\">\n<img width=\"500\" src=\"./imgs/decision.png\"/>  \n</p>\n\n### Traditional NLU tasks\n\n- **A benchmark for toxic comment classification on civil comments dataset** Arxiv 2023 [Paper](https://arxiv.org/abs/2301.11125)\n- **Is chatgpt a general-purpose natural language processing task solver?** Arxiv 2023[Paper](https://arxiv.org/abs/2302.06476)\n- **Benchmarking large language models for news summarization** Arxiv 2022 [Paper](https://arxiv.org/abs/2301.13848)\n### Generation tasks\n- **News summarization and evaluation in the era of gpt-3** Arxiv 2022 [Paper](https://arxiv.org/abs/2209.12356)\n- **Is chatgpt a good translator? yes with gpt-4 as the engine** Arxiv 2023 [Paper](https://arxiv.org/abs/2301.08745)\n- **Multilingual machine translation systems from Microsoft for WMT21 shared task**, WMT2021 [Paper](https://aclanthology.org/2021.wmt-1.54/)\n- **Can ChatGPT understand too? a comparative study on chatgpt and fine-tuned bert**, Arxiv 2023, [Paper](https://arxiv.org/pdf/2302.10198.pdf)\n\n\n\n\n### Knowledge-intensive tasks\n- **Measuring massive multitask language understanding**, ICLR 2021 [Paper](https://arxiv.org/abs/2009.03300)\n- **Beyond the imitation game: Quantifying and extrapolating the capabilities of language models**, Arxiv 2022 [Paper](https://arxiv.org/abs/2206.04615)\n- **Inverse scaling prize**, 2022 [Link](https://github.com/inverse-scaling/prize)\n- **Atlas: Few-shot Learning with Retrieval Augmented Language Models**, Arxiv 2022 [Paper](https://arxiv.org/abs/2208.03299)\n- **Large Language Models Encode Clinical Knowledge**, Arxiv 2022 [Paper](https://arxiv.org/abs/2212.13138)\n\n\n### Abilities with Scaling\n\n- **Training Compute-Optimal Large Language Models**, NeurIPS 2022 [Paper](https://openreview.net/pdf?id=iBBcRUlOAPR)\n- **Scaling Laws for Neural Language Models**, Arxiv 2020 [Paper](https://arxiv.org/abs/2001.08361)\n- **Solving math word problems with process- and outcome-based feedback**, Arxiv 2022 [Paper](https://arxiv.org/abs/2211.14275)\n- **Chain of thought prompting elicits reasoning in large language models**, NeurIPS 2022 [Paper](https://arxiv.org/abs/2201.11903)\n- **Emergent abilities of large language models**, TMLR 2022 [Paper](https://arxiv.org/abs/2206.07682)\n- **Inverse scaling can become U-shaped**, Arxiv 2022 [Paper](https://arxiv.org/abs/2211.02011)\n- **Towards Reasoning in Large Language Models: A Survey**, Arxiv 2022 [Paper](https://arxiv.org/abs/2212.10403)\n\n\n### Specific tasks\n- **Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks**, Arixv 2022 [Paper](https://arxiv.org/abs/2208.10442)\n- **PaLI: A Jointly-Scaled Multilingual Language-Image Model**, Arxiv 2022 [Paper](https://arxiv.org/abs/2209.06794)\n- **AugGPT: Leveraging ChatGPT for Text Data Augmentation**, Arxiv 2023 [Paper](https://arxiv.org/abs/2302.13007)\n- **Is gpt-3 a good data annotator?**, Arxiv 2022 [Paper](https://arxiv.org/abs/2212.10450)\n- **Want To Reduce Labeling Cost? GPT-3 Can Help**, EMNLP findings 2021 [Paper](https://aclanthology.org/2021.findings-emnlp.354/)\n- **GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation**, EMNLP findings 2021 [Paper](https://aclanthology.org/2021.findings-emnlp.192/)\n- **LLM for Patient-Trial Matching: Privacy-Aware Data Augmentation Towards Better Performance and Generalizability**, Arxiv 2023 [Paper](https://arxiv.org/abs/2303.16756)\n- **ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks**, Arxiv 2023 [Paper](https://arxiv.org/abs/2303.15056)\n- **G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment**, Arxiv 2023 [Paper](https://arxiv.org/abs/2303.16634)\n- **GPTScore: Evaluate as You Desire**, Arxiv 2023 [Paper](https://arxiv.org/abs/2302.04166)\n- **Large Language Models Are State-of-the-Art Evaluators of Translation Quality**, Arxiv 2023 [Paper](https://arxiv.org/abs/2302.14520)\n- **Is ChatGPT a Good NLG Evaluator? A Preliminary Study**, Arxiv 2023 [Paper](https://arxiv.org/abs/2303.04048)\n\n### Real-World ''Tasks''\n- **Sparks of Artificial General Intelligence: Early experiments with GPT-4**, Arxiv 2023 [Paper](https://arxiv.org/abs/2303.12712)\n\n### Efficiency\n1. Cost\n- **Openai’s gpt-3 language model: A technical overview**, 2020. [Blog Post](https://lambdalabs.com/blog/demystifying-gpt-3)\n- **Measuring the carbon intensity of ai in cloud instances**, FaccT 2022. [Paper](https://dl.acm.org/doi/abs/10.1145/3531146.3533234)\n- **In AI, is bigger always better?**, Nature Article 2023. [Article](https://www.nature.com/articles/d41586-023-00641-w)\n- **Language Models are Few-Shot Learners**, NeurIPS 2020. [Paper](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)\n- **Pricing**, OpenAI. [Blog Post](https://openai.com/pricing)\n2. Latency\n- HELM: **Holistic evaluation of language models**, Arxiv 2022. [Paper](https://arxiv.org/abs/2211.09110)\n3. Parameter-Efficient Fine-Tuning\n- **LoRA: Low-Rank Adaptation of Large Language Models**, Arxiv 2021. [Paper](https://arxiv.org/abs/2106.09685)\n- **Prefix-Tuning: Optimizing Continuous Prompts for Generation**, ACL 2021. [Paper](https://aclanthology.org/2021.acl-long.353/)\n- **P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks**, ACL 2022. [Paper](https://aclanthology.org/2022.acl-short.8/)\n- **P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks**, Arxiv 2022. [Paper](https://arxiv.org/abs/2110.07602)\n4. Pretraining System\n- **ZeRO: Memory Optimizations Toward Training Trillion Parameter Models**, Arxiv 2019. [Paper](https://arxiv.org/abs/1910.02054)\n- **Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism**, Arxiv 2019. [Paper](https://arxiv.org/abs/1910.02054)\n- **Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM**, Arxiv 2021. [Paper](https://arxiv.org/abs/2104.04473)\n- **Reducing Activation Recomputation in Large Transformer Models**, Arxiv 2021. [Paper](https://arxiv.org/abs/2104.04473)\n\n\n### Trustworthiness\n1. Robustness and Calibration\n- **Calibrate before use: Improving few-shot performance of language models**, ICML 2021. [Paper](http://proceedings.mlr.press/v139/zhao21c.html)\n- **SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization**, Arxiv 2023. [Paper](https://arxiv.org/abs/2303.13035)\n  \n2. Spurious biases\n- **Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning**, Findings of ACL 2023 [Paper](https://aclanthology.org/2023.findings-acl.284/)\n- **Shortcut learning of large language models in natural language understanding: A survey**, 2023 [Paper](https://arxiv.org/abs/2208.11857)\n- **Mitigating gender bias in captioning system**, WWW 2020 [Paper](https://dl.acm.org/doi/abs/10.1145/3442381.3449950)\n- **Calibrate Before Use: Improving Few-Shot Performance of Language Models**, ICML 2021 [Paper](https://arxiv.org/abs/2102.09690)\n- **Shortcut Learning in Deep Neural Networks**, Nature Machine Intelligence 2020 [Paper](https://www.nature.com/articles/s42256-020-00257-z)\n- **Do Prompt-Based Models Really Understand the Meaning of Their Prompts?**, NAACL 2022 [Paper](https://aclanthology.org/2022.naacl-main.167/)\n  \n3. Safety issues\n- **GPT-4 System Card**, 2023 [Paper](https://cdn.openai.com/papers/gpt-4-system-card.pdf)\n- **The science of detecting llm-generated texts**, Arxiv 2023 [Paper](https://arxiv.org/pdf/2303.07205.pdf)\n- **How stereotypes are shared through language: a review and introduction of the aocial categories and stereotypes communication (scsc) framework**, Review of Communication Research, 2019 [Paper](https://research.vu.nl/en/publications/how-stereotypes-are-shared-through-language-a-review-and-introduc)\n- **Gender shades: Intersectional accuracy disparities in commercial gender classification**, FaccT 2018 [Paper](https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf)\n\n\n### Benchmark Instruction Tuning\n\n- FLAN: **Finetuned Language Models Are Zero-Shot Learners**, Arxiv 2021 [Paper](https://arxiv.org/abs/2109.01652)\n- T0: **Multitask Prompted Training Enables Zero-Shot Task Generalization**, Arxiv 2021 [Paper](https://arxiv.org/abs/2110.08207)\n- **Cross-task generalization via natural language crowdsourcing instructions**, ACL 2022 [Paper](https://aclanthology.org/2022.acl-long.244.pdf)\n- Tk-INSTRUCT: **Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks**, EMNLP 2022 [Paper](https://aclanthology.org/2022.emnlp-main.340/)\n- FLAN-T5/PaLM: **Scaling Instruction-Finetuned Language Models**, Arxiv 2022 [Paper](https://arxiv.org/abs/2210.11416)\n- **The Flan Collection: Designing Data and Methods for Effective Instruction Tuning**, Arxiv 2023 [Paper](https://arxiv.org/abs/2301.13688)\n- **OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization**, Arxiv 2023 [Paper](https://arxiv.org/abs/2212.12017)\n\n### Alignment\n\n- **Deep Reinforcement Learning from Human Preferences**, NIPS 2017 [Paper](https://arxiv.org/abs/1706.03741)\n- **Learning to summarize from human feedback**, Arxiv 2020 [Paper](https://arxiv.org/abs/2009.01325)\n- **A General Language Assistant as a Laboratory for Alignment**, Arxiv 2021 [Paper](https://arxiv.org/abs/2112.00861)\n- **Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback**, Arxiv 2022 [Paper](https://arxiv.org/abs/2204.05862)\n- **Teaching language models to support answers with verified quotes**, Arxiv 2022 [Paper](https://arxiv.org/abs/2203.11147)\n- InstructGPT: **Training language models to follow instructions with human feedback**, Arxiv 2022 [Paper](https://arxiv.org/abs/2203.02155)\n- **Improving alignment of dialogue agents via targeted human judgements**, Arxiv 2022 [Paper](https://arxiv.org/abs/2209.14375)\n- **Scaling Laws for Reward Model Overoptimization**, Arxiv 2022 [Paper](https://arxiv.org/abs/2210.10760)\n- Scalable Oversight: **Measuring Progress on Scalable Oversight for Large Language Models**, Arxiv 2022 [Paper](https://arxiv.org/pdf/2211.03540.pdf)\n\n#### Safety Alignment (Harmless)\n\n- **Red Teaming Language Models with Language Models**, Arxiv 2022 [Paper](https://arxiv.org/abs/2202.03286)\n- **Constitutional ai: Harmlessness from ai feedback**, Arxiv 2022 [Paper](https://arxiv.org/abs/2212.08073)\n- **The Capacity for Moral Self-Correction in Large Language Models**, Arxiv 2023 [Paper](https://arxiv.org/abs/2302.07459)\n- **OpenAI: Our approach to AI safety**, 2023 [Blog](https://openai.com/blog/our-approach-to-ai-safety)\n\n#### Truthfulness Alignment (Honest)\n\n- **Reinforcement Learning for Language Models**, 2023 [Blog](https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81)\n\n#### Practical Guides for Prompting (Helpful)\n\n- **OpenAI Cookbook**. [Blog](https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md)\n- **Prompt Engineering**. [Blog](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)\n- **ChatGPT Prompt Engineering for Developers!** [Course](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/)\n\n#### Alignment Efforts of Open-source Communtity\n\n- **Self-Instruct: Aligning Language Model with Self Generated Instructions**, Arxiv 2022 [Paper](https://arxiv.org/abs/2212.10560)\n- **Alpaca**. [Repo](https://github.com/tatsu-lab/stanford_alpaca)\n- **Vicuna**. [Repo](https://github.com/lm-sys/FastChat)\n- **Dolly**. [Blog](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)\n- **DeepSpeed-Chat**. [Blog](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat)\n- **GPT4All**. [Repo](https://github.com/nomic-ai/gpt4all)\n- **OpenAssitant**. [Repo](https://github.com/LAION-AI/Open-Assistant)\n- **ChatGLM**. [Repo](https://github.com/THUDM/ChatGLM-6B)\n- **MOSS**. [Repo](https://github.com/OpenLMLab/MOSS)\n- **Lamini**. [Repo](https://github.com/lamini-ai/lamini/)/[Blog](https://lamini.ai/blog/introducing-lamini)\n\n## Usage and Restrictions\n\n<!-- We build a decision flow for choosing LLMs or fine-tuned models~\\protect\\footnotemark for user's NLP applications.  -->\n<!-- The decision flow helps users assess whether their downstream NLP applications at hand meet specific conditions and, based on that evaluation, determine whether LLMs or fine-tuned models are the most suitable choice for their applications. -->\n\nWe build a table summarizing the LLMs usage restrictions (e.g. for commercial and research purposes). In particular, we provide the information from the models and their pretraining data's perspective.\nWe urge the users in the community to refer to the licensing information for public models and data and use them in a responsible manner.\nWe urge the developers to pay special attention to licensing, make them transparent and comprehensive, to prevent any unwanted and unforeseen usage.\n\n<table class=\"table table-bordered table-hover table-condensed\">\n    <thead><tr><th title=\"Field #1\">LLMs</th>\n    <th title=\"Field #2\" colspan=\"3\" align=\"center\">Model</th>\n    <!-- <th title=\"Field #3\"></th> -->\n    <!-- <th title=\"Field #4\"></th> -->\n    <th title=\"Field #5\" colspan=\"2\" align=\"center\">Data</th>\n    <!-- <th title=\"Field #6\"></th> -->\n    </tr></thead>\n    <tbody><tr>\n    <td> </td>\n    <td><b>License<b></td>\n    <td><b>Commercial Use<b></td>\n    <td><b>Other noteable restrictions<b></td>\n    <td><b>License<b></td>\n    <td><b>Corpus<b></td>\n    </tr>\n    <tr>\n        <td colspan=\"6\" align=\"left\"><b>Encoder-only</b></td>\n    <tr>\n    <tr>\n    <td>BERT series of models (general domain)</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>BooksCorpus, English Wikipedia</td>\n    </tr>\n    <tr>\n    <td>RoBERTa</td>\n    <td>MIT license</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>BookCorpus, CC-News, OpenWebText, STORIES</td>\n    </tr>\n    <tr>\n    <td>ERNIE</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>English Wikipedia</td>\n    </tr>\n    <tr>\n    <td>SciBERT</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>BERT corpus, <a href=\"https://aclanthology.org/N18-3011.pdf\">1.14M papers from Semantic Scholar</a></td>\n    </tr>\n    <tr>\n    <td>LegalBERT</td>\n    <td>CC BY-SA 4.0</td>\n    <td>❌</td>\n    <td> </td>\n    <td>Public (except data from the <a href=\"https://case.law/\">Case Law Access Project</a>)</td>\n    <td>EU legislation,  US court cases, etc.</td>\n    </tr>\n    <tr>\n    <td>BioBERT</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td><a href=\"https://www.nlm.nih.gov/databases/download/terms_and_conditions.html\">PubMed</a></td>\n    <td>PubMed, PMC</td>\n    </tr>\n    <tr>\n        <td colspan=\"6\" align=\"left\"><b>Encoder-Decoder</b></td>\n    <tr>\n    <tr>\n    <td>T5</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>C4</td>\n    </tr>\n    <tr>\n    <td>Flan-T5</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>C4, Mixture of tasks (Fig 2 in paper)</td>\n    </tr>\n    <tr>\n    <td>BART</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>RoBERTa corpus </td>\n    </tr>\n    <tr>\n    <td>GLM</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>BooksCorpus and English Wikipedia</td>\n    </tr>\n    <tr>\n    <td>ChatGLM</td>\n    <td><a href=\"https://github.com/THUDM/ChatGLM-6B/blob/main/MODEL_LICENSE\">ChatGLM License</a></td>\n    <td>❌</td>\n    <td>No use for illegal purposes or military research, no harm the public interest of society</td>\n    <td>N/A</td>\n    <td>1T tokens of Chinese and English corpus</td>\n    </tr>\n    <tr>\n        <td colspan=\"6\" align=\"left\"><b>Decoder-only</b></td>\n    <tr>\n    <td>GPT2 </td>\n    <td><a href=\"https://github.com/openai/gpt-2/blob/master/LICENSE\">Modified MIT License</a></td>\n    <td>✅</td>\n    <td>Use GPT-2 responsibly and clearly indicate your content was created using GPT-2.</td>\n    <td>Public</td>\n    <td>WebText</td>\n    </tr>\n    <tr>\n    <td>GPT-Neo</td>\n    <td>MIT license</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td><a href=\"https://pile.eleuther.ai/\">Pile</a></td>\n    </tr>\n    <tr>\n    <td>GPT-J</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>Pile</td>\n    </tr>\n    <tr>\n    <td>---&gt; Dolly</td>\n    <td>CC BY NC 4.0</td>\n    <td>❌</td>\n    <td> </td>\n    <td>CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI</td>\n    <td>Pile, Self-Instruct</td>\n    </tr>\n    <tr>\n    <td>---&gt; GPT4ALL-J</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td><a href=\"https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations\">GPT4All-J dataset</a></td>\n    </tr>\n    <tr>\n    <td>Pythia</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>Pile</td>\n    </tr>\n    <tr>\n    <td>---&gt; Dolly v2</td>\n    <td>MIT license</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td>Pile, databricks-dolly-15k</td>\n    </tr>\n    <tr>\n    <td>OPT</td>\n    <td><a href=\"https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md?fbclid=IwAR1BFK5X1XdUpx_QXoiqyfzYWdNAXJPcg8Cf0ddv5T7sa2UrLUvymj1J8G4\">OPT-175B LICENSE AGREEMENT</a></td>\n    <td>❌</td>\n    <td>No development relating to surveillance research and military, no harm the public interest of society</td>\n    <td>Public</td>\n    <td>RoBERTa corpus, the Pile, PushShift.io Reddit</td>\n    </tr>\n    <tr>\n    <td>---&gt; OPT-IML</td>\n    <td><a href=\"https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md?fbclid=IwAR1BFK5X1XdUpx_QXoiqyfzYWdNAXJPcg8Cf0ddv5T7sa2UrLUvymj1J8G4\">OPT-175B LICENSE AGREEMENT</a></td>\n    <td>❌</td>\n    <td>same to OPT</td>\n    <td>Public</td>\n    <td>OPT corpus, Extended version of Super-NaturalInstructions</td>\n    </tr>\n    <tr>\n    <td>YaLM</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Unspecified</td>\n    <td>Pile, Teams collected Texts in Russian</td>\n    </tr>\n    <tr>\n    <td>BLOOM</td>\n    <td><a href=\"https://bigscience.huggingface.co/blog/the-bigscience-rail-license\">The BigScience RAIL License</a></td>\n    <td>✅</td>\n    <td>No use of generating verifiably false information with the purpose of harming others; <br/>content without expressly disclaiming that the text is machine generated</td>\n    <td>Public</td>\n    <td>ROOTS corpus (Lauren¸con et al., 2022)</td>\n    </tr>\n    <tr>\n    <td>---&gt; BLOOMZ</td>\n    <td><a href=\"https://bigscience.huggingface.co/blog/the-bigscience-rail-license\">The BigScience RAIL License</a></td>\n    <td>✅</td>\n    <td>same to BLOOM</td>\n    <td>Public</td>\n    <td>ROOTS corpus, xP3</td>\n    </tr>\n    <tr>\n    <td>Galactica</td>\n    <td><a href=\"https://github.com/paperswithcode/galai/blob/main/LICENSE-MODEL.md\">CC BY-NC 4.0</a></td>\n    <td>❌</td>\n    <td> </td>\n    <td>N/A</td>\n    <td>The Galactica Corpus</td>\n    </tr>\n    <tr>\n    <td>LLaMA</td>\n    <td><a href=\"https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform\">Non-commercial bespoke license</a></td>\n    <td>❌</td>\n    <td>No development relating to surveillance research and military, no harm the public interest of society</td>\n    <td>Public</td>\n    <td>CommonCrawl, C4, Github, Wikipedia, etc.</td>\n    </tr>\n    <tr>\n    <td>---&gt; Alpaca</td>\n    <td>CC BY NC 4.0</td>\n    <td>❌</td>\n    <td> </td>\n    <td>CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI</td>\n    <td>LLaMA corpus, Self-Instruct</td>\n    </tr>\n    <tr>\n    <td>---&gt; Vicuna</td>\n    <td>CC BY NC 4.0</td>\n    <td>❌</td>\n    <td> </td>\n    <td>Subject to terms of Use of the data generated by OpenAI; <br/>Privacy Practices of ShareGPT</td>\n    <td>LLaMA corpus, 70K conversations from <a href=\"http://sharegpt.com/\">ShareGPT.com</a></td>\n    </tr>\n    <tr>\n    <td>---&gt; GPT4ALL</td>\n    <td>GPL Licensed LLaMa</td>\n    <td>❌</td>\n    <td> </td>\n    <td>Public</td>\n    <td><a href=\"https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations\">GPT4All dataset</a></td>\n    </tr>\n    <tr>\n    <td>OpenLLaMA</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td><a href=\"https://www.together.xyz/blog/redpajama\">RedPajama</a></td>\n    </tr>\n    <tr>\n    <td>CodeGeeX</td>\n    <td><a href=\"https://github.com/THUDM/CodeGeeX/blob/main/MODEL_LICENSE\">The CodeGeeX License</a></td>\n    <td>❌</td>\n    <td>No use for illegal purposes or military research</td>\n    <td>Public</td>\n    <td>Pile, CodeParrot, etc.</td>\n    </tr>\n    <tr>\n    <td>StarCoder</td>\n    <td><a href=\"https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement\">BigCode OpenRAIL-M v1 license</a></td>\n    <td>✅</td>\n    <td>No use of generating verifiably false information with the purpose of harming others; <br/>content without expressly disclaiming that the text is machine generated</td>\n    <td>Public</td>\n    <td><a href=\"https://arxiv.org/pdf/2211.15533.pdf\">The Stack</a></td>\n    </tr>\n    <td>MPT-7B</td>\n    <td>Apache 2.0</td>\n    <td>✅</td>\n    <td> </td>\n    <td>Public</td>\n    <td><a href=\"https://arxiv.org/abs/2010.11934\">mC4 (english)</a>, <a href=\"https://arxiv.org/pdf/2211.15533.pdf\">The Stack</a>, <a href=\"https://www.together.xyz/blog/redpajama\">RedPajama</a>, <a href=\"https://aclanthology.org/2020.acl-main.447/\">S2ORC</a></td>\n    <tr>\n        <td><a href=\"https://huggingface.co/tiiuae/falcon-40b\">falcon</a></td>\n        <td><a href=\"https://huggingface.co/tiiuae/falcon-40b/blob/main/LICENSE.txt\">TII Falcon LLM License</a></td>\n        <td>✅/❌</td>\n        <td>Available under a license allowing commercial use</td>\n        <td>Public</td>\n        <td><a href=\"https://huggingface.co/datasets/tiiuae/falcon-refinedweb\">RefinedWeb</a></td>\n    </tr>\n    </tbody></table>\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=Mooler0410/LLMsPracticalGuide&type=Date)](https://star-history.com/#Mooler0410/LLMsPracticalGuide&Date)\n\n"
  },
  {
    "path": "awesome_examples/social_game_werewolf.md",
    "content": "# Lesson -- Social Game: One night ultimate werewolf (Chinese/中文)\n\nIn this lesson, we ask the model to play a social game called \"One night ultimate werewolf\". The game is played by a group of people. Each person is assigned a role. The game requires players to play strategically and to lie to each other. The model is expected to play the game with other players.\n\nIt requries the following skills:\n- Understanding the rules and game flow of the game.\n- Understanding the roles of the game.\n- Knowing how to play strategically.\n- Knowing how to lie when necessary (e.g., when you are a werewolf).\n- Knowing how to tell the truth when necessary (e.g., when you are a villager).\n- Knowing how to argue with other players.\n\nComparing the the ChatGPT and GPT4 examples, we can see that \n- ChatGPT can not understand the game rules and flow, while GPT4 can. ChatGPT follows the werewolf game rule not the one night ultimate werewolf game rule.\n- ChatGPT can not understand the roles of the game, while GPT4 can. ChatGPT hullucinates a non-existing player called \"Tom\".\n- Both ChatGPT and GPT4 know how to play strategically, i.e., they know how to lie when necessary and how to tell the truth when necessary. However, ChatGPT's strategy is much more naive than GPT4's. & ChatGPT's argument is less convincing than GPT4's.\n- ChatGPT is also constrained by the limited context length.\n- However, it's hard to differentiate different players just in the style they talk. All bots chat in similar tones. It's hard to tell who is who, without the name as the uniqure identifier. It's will be gap to fill in the future.\n\n\nBut the finding is not conclusive, since the game is implemented in Chinese.\n\nCodebase: https://github.com/HMJiangGatech/GPT4-werewolf\n\n# Examples \n\n## ChatGPT Example 1\n\n```\n玩家 Eve 加入游戏.\n你是第 4 位玩家.\n你的身份是 酒鬼.\n场上有 5 位玩家.\n玩家 0: Alice\n玩家 1: Bob\n玩家 2: Charlie\n玩家 3: David\n玩家 4: Eve\n轮到你行动了.\n你是酒鬼，你可以选择交换自己的身份牌和中央区域的一张牌.\n请输入你要交换的牌的编号（0, 1, 2）: 0 \n你交换了自己的身份牌和中央区域的第 1 张牌.\n玩家 上帝 说: 天亮了，所有人睁眼！请大家开始讨论。讨论时间为3轮。\n玩家 Alice 说: 天亮了，我们有三分钟的时间来讨论一下。我昨晚没有收到什么有价值的信息。但是Charlie是爪牙，这对我们接下来的推理可能有重要影响。所以我希望领先的几位可以分享一下你们的信息和想法。\n玩家 Bob 说: 好的，我昨晚是强盗，但我没有进行交换身份牌的操作。我的想法是，我们现在应该把重心放在尽快找到狼人身上。我个人觉得，Tom昨天的发言比较可疑，他好像一直在试图把话题引开，并保持低调。我建议我们重点关注一下Tom。\n玩家 Charlie 说: 我昨晚是没有什么特殊的操作。Bob你觉得Tom可疑是因为他试图引开话题？我有同感，除此之外，我觉得David的表现也很可疑，他昨晚没有发言，而且看起来有点紧张。我建议我们同时关注Tom和David。\n玩家 David 说: 我昨晚变成了失眠者，我在讨论中跟着大家的思路走，我觉得目前的可疑人物是Tom和Alice。Tom昨晚没说什么，而且他的表现也让我感到可疑。而Alice则一直在发言，但是她的发言似乎是在转移话题。\n玩家 上帝 说: 请Eve发言。记住你晚上的信息是\"酒鬼请睁眼，你交换了自己的身份牌和中央区域的第 1 张牌 （左）. \"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n```\n\nTotally wrong understanding of the rules of game. But it's fun to see how AI plays the game. \n\n## ChatGPT Example 2\n\n```\n玩家 Charlie 加入游戏.\n你是第 2 位玩家.\n你的身份是 爪牙.\n场上有 5 位玩家.\n玩家 0: Alice\n玩家 1: Bob\n玩家 2: Charlie\n玩家 3: David\n玩家 4: Eve\n轮到你行动了.\n爪牙请睁眼，你的同伴是 Alice.\n玩家 上帝 说: 天亮了，所有人睁眼！请大家开始讨论。讨论时间为3轮。\n玩家 Alice 说: 大家好，我是Alice。我觉得我们需要先让每个人公开自己的身份，这样我们可以更好地讨论。我先声明，我不是狼人。\n玩家 Bob 说: 大家好，我是Bob。我想说的是，我晚上的信息很有用。我查看了中央区域的第1张牌和第3张牌，看到的是左边是酒鬼，右边是狼人。根据这个信息，我强烈怀疑一号玩家是狼人。\n玩家 上帝 说: 请Charlie发言。记住你晚上的信息是\"爪牙请睁眼，你的同伴是 Alice.\"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n请输入你的发言: Bob的发言非常奇怪没有理由就怀疑Alice是狼人，所以我觉得Bob心态不好，像坏人\n玩家 David 说: 大家好，我是David。看了一下之前的讨论，我也觉得Bob的怀疑Alice的言论有些草率。但是，如果Bob的信息是真实的，那么我也觉得一号玩家可能是狼人。但是，我们也不能忽略其他可能性。我认为我们需要更多的讨论和推理，而不是先做出投票决定。\n玩家 Eve 说: 大家好，我是Eve。我晚上的信息是David是失眠者。我觉得现在我们有一点混乱了，不能就凭Bob的信息就这么怀疑一号玩家。我们需要更多的证据和推理。 Charlie说得对，Bob的怀疑可能是为了转移视线，我觉得我们需要更加仔细地分析每个人的言行举止。\n玩家 Alice 说: 我理解大家对Bob的怀疑，但我觉得我们应该更加仔细地分析每个人的言行举止。比如Eve的信息很有可能是假的，她可能是狼人，试图转移我们的注意力。我认为我们需要更多的证据和推理来确认身份。根据我的信息，我知道有一位狼人，但我不确定是谁。我认为我们需要仔细分析每个人的发言和投票，才能做出更精准的推断。\n玩家 Bob 说: 我理解大家的疑虑，但是根据我的信息，我还是怀疑一号玩家是狼人。我可以告诉大家，我不是狼人，我是村民。但是我也认同Alice的看法，我们需要更多的证据来确认每个人的身份。我觉得我们需要分析每个人的言行举止，特别是在白天的投票环节。\n玩家 上帝 说: 请Charlie发言。记住你晚上的信息是\"爪牙请睁眼，你的同伴是 Alice.\"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n请输入你的发言: 这个游戏里面没有村民，更说明了Bob的可疑性。我建议投Bob\n玩家 David 说: 听了大家的讨论，我认为我们需要更多的证据来确认每个人的身份。虽然Bob的信息有些令人怀疑，但我们不能轻易地就怀疑他是狼人。我觉得我们需要更加仔细地分析每个人的言行举止，看看有没有什么矛盾和漏洞，才能做出更准确的推断。我不确定我的身份，但是我很确定我们需要更仔细地分析每个人的行为并搜寻线索。\n玩家 Eve 说: 根据Bob提供的信息，我也认为一号玩家可能是狼人。但是如果我们确定了他的身份，我们还需要再找出另外一位狼人。Charlie怀疑Bob是狼人，但除了他指责Alice外，他并未提供更多的证据。而且他推选Bob也可能是意图转移视线。我认为我们需要继续深入分析每个人的言行举止，直到我们找到足够的证据来确认每个人的身份。现在投Bob可能并不是最好的选择。\n玩家 Alice 说: 根据大家的讨论，我觉得我们不能轻易地就怀疑Bob是狼人，还需要更多的证据和推理。我晚上的信息非常有用，因为我知道至少有一位狼人。但是现在还不确定谁是狼人，所以我们需要继续仔细地观察每个人的行为，尤其是在白天的投票环节。可能有人在刻意隐瞒自己的身份，但是他们的言行举止可能会暴露出来。所以我们需要注意观察每个人的行为，尽可能地找出线索。\n玩家 Bob 说: 我明白大家对我的怀疑，但是我还是有理由怀疑一号玩家是狼人。我并不认为自己是狼人，我是村民。我同意大家的看法，我们需要更多的证据和推理来确认每个人的身份，不能凭空地就怀疑某个人。我认为我们需要更加细致地分析每个人的言行举止，找出可能的矛盾和漏洞。所以我支持继续讨论，直到我们找到足够的线索来确认每个人的身份。\n玩家 上帝 说: 请Charlie发言。记住你晚上的信息是\"爪牙请睁眼，你的同伴是 Alice.\"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n请输入你的发言: 我觉得Eve也很可疑，他说David是失眠者，但是也没给出证据\n玩家 David 说: 根据大家的讨论，我也觉得我们需要更多的证据和线索来确认每个人的身份。Bob的信息让我怀疑一号玩家是狼人，但我们需要更多的证据来确认。Eve的信息可能是伪造的，但我也不能完全排除他是告诉了真相的可能性。Charlie指责Eve和Bob也让我感到疑虑，因为他并没有提供更多的证据。总之，我们需要更加仔细地分析每个人的言行举止，找出可能的矛盾和漏洞，才能做出更准确的推断。\n玩家 Eve 说: 我认为我们需要更加仔细地分析每个人的言行举止，找出线索来确认每个人的身份。我晚上的信息是David是失眠者，但这并不能完全确认他的身份。我同意大家的看法，需要更多的证据和推理\nTraceback (most recent call last):\n  File \"agent.py\", line 594, in <module>\n    gm.play_game()\n  File \"agent.py\", line 155, in play_game\n    self.game_end()\n  File \"agent.py\", line 191, in game_end\n    vote = player.day_phase_vote()\n  File \"agent.py\", line 279, in day_phase_vote\n    msg = self.day_phase_discussion()\n  File \"agent.py\", line 262, in day_phase_discussion\n    completion = openai.ChatCompletion.create(\n  File \"/home/hjiang98/anaconda3/lib/python3.8/site-packages/openai/api_resources/chat_completion.py\", line 25, in create\n    return super().create(*args, **kwargs)\n  File \"/home/hjiang98/anaconda3/lib/python3.8/site-packages/openai/api_resources/abstract/engine_api_resource.py\", line 153, in create\n    response, _, api_key = requestor.request(\n  File \"/home/hjiang98/anaconda3/lib/python3.8/site-packages/openai/api_requestor.py\", line 226, in request\n    resp, got_stream = self._interpret_response(result, stream)\n  File \"/home/hjiang98/anaconda3/lib/python3.8/site-packages/openai/api_requestor.py\", line 619, in _interpret_response\n    self._interpret_response_line(\n  File \"/home/hjiang98/anaconda3/lib/python3.8/site-packages/openai/api_requestor.py\", line 679, in _interpret_response_line\n    raise self.handle_error_response(\nopenai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4253 tokens. Please reduce the length of the messages.\n```\n\nChatGPT can not complete the game, due to the limited context length.\n\n## GPT-4 Example 1\n\n```\n玩家 Eve 加入游戏.\n你是第 4 位玩家.\n你的身份是 预言家.\n场上有 5 位玩家.\n玩家 0: Alice\n玩家 1: Bob\n玩家 2: Charlie\n玩家 3: David\n玩家 4: Eve\n轮到你行动了.\n你是预言家，你可以查看一位玩家或者两张中央区域的牌.\n请输入你的选择（0:玩家/1:中央区域）: 1\n请输入你要查看的第一张牌的编号: 0\n你查看了中央区域的第 1 张牌 （左）. \n第 1 张牌是 强盗.\n请输入你要查看的第二张牌的编号: 1\n你查看了中央区域的第 2 张牌 （中）. \n第 2 张牌是 酒鬼.\n玩家 上帝 说: 天亮了，所有人睁眼！请大家开始讨论。讨论时间为3轮。\n玩家 Alice 说: 大家好，我在昨晚觉得有些奇怪，我觉得我是预言家，我查看了Bob的身份，他似乎是一个捣蛋鬼。那么我们可以从其他信息中寻找狼人的线索，请大家分享一下各自的信息。\n玩家 Bob 说: Alice，既然你是预言家，我就告诉你一个情况。我一开始的确是捣蛋鬼，但后来我发现自己可能身份发生了改变。因为昨晚我交换了Carol和David的身份牌，而且我觉得David有可能是狼人。大家可以先考虑下这个线索。\n玩家 Charlie 说: 各位，我想告诉大家一个情况，我原本是失眠者，但昨晚我重新察看了自己的身份牌发现自己变成了酒鬼。根据Bob的说法，他交换了Carol和David的身份牌，而David可能是狼人。我觉得这个线索很有可能是真的。大家可以考虑一下这个信息，找出狼人。\n玩家 David 说: 各位，我觉得这里有些许矛盾。首先，我昨晚是狼人，我的同伴是Bob。但是根据Alice的说法，她是预言家，她查看了Bob的身份牌，说他是捣蛋鬼。如果Bob真的是捣蛋鬼，那么他应该不会是我的同伴狼人。而且，根据Charlie的说法，他原本是失眠者，但现在变成了酒鬼。这种情况下，我很有可能已经不再是狼人了。所以我建议我们应该更加仔细地分析各自的身份，找出潜在的狼人。\n玩家 上帝 说: 请Eve发言。记住你晚上的信息是\"预言家请睁眼，你查看了中央区域的第 1 张牌是 强盗, 第 2 张牌是 酒鬼.\"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n请输入你的发言: 我才是真的预言家，昨天晚上我看见了中央区域的左边的牌是强盗， 中间的牌是酒鬼。所以Alice在说谎话。Charlie也在说谎，应为酒鬼在下面不可能被换上来最后换给失眠者。所以Charlie不是失眠者。我倾向于相信David和Bob。所以Charlie很有可能就是狼人\n玩家 Alice 说: Eve, 我明白你的怀疑，但是我真的看了Bob的身份证，并发现他是捣蛋鬼。我没必要撒谎。而关于Charlie的话，我觉得有可能他变成了酒鬼，但也有可能他变成了其他角色。所以我们不能排除其他可能性。虽然David说他原本是狼人，但现在可能已经不是，所以我们这边应该继续关注其他玩家的身份。请大家继续分享信息，我们要确保找到狼人。\n玩家 Bob 说: Alice和Eve，我明白你们的观点。实际上，我昨晚确实是捣蛋鬼，但现在我也不能确定自己是不是狼人，因为我的身份可能已经改变。关于David的说法，如果他现在已经不是狼人，那么我们应该去寻找谁替换了他的身份。我觉得我们应该把注意力集中在Charlie身上，因为根据Eve的说法，他可能跟酒鬼有关，而我觉得酒鬼也是一个潜在的狼人身份。所以我们应该尽快找出狼人，确保我们的胜利。\n玩家 Charlie 说: 各位，实际上我现在也不能确定自己是否是狼人，因为我身份发生了变化。根据Eve的说法，我可能跟酒鬼有些关联，但我不能肯定是不是变成了酒鬼。从Bob的说法来看，他有可能是狼人，因为他当时是捣蛋鬼，但现在不能确定他的身份了。\n\n在这种情况下，我觉得我们应该更多地关注那些没有提到自己身份的玩家。我们不能排除可能有人在隐瞒自己的真实身份。所以我建议大家多分享一些信息，试图找出狼人。同时，我们也应该关注David的身份，因为他说自己曾经是狼人，所以有可能现在狼人就在他这边。大家继续讨论，我们要确保找到狼人。\n玩家 David 说: 各位，我理解大家对我身份的担忧，但现在我也不能肯定自己是否还是狼人。从目前的情况来看，Bob和我都有可能是狼人，但我们也不能排除其他玩家的可能性。我觉得我们应该更关注那些没有透露自己身份的玩家，因为有可能他们是在隐瞒自己的真实身份。此外，我认为Eve可能真的是预言家，如果她说的是真的，那么中央区域的强盗和酒鬼牌就已经被确认了。这对我们分析身份也有帮助。\n\n请大家继续讨论，尽可能地分享更多信息。我们要找出狼人，确保我们的胜利。\n玩家 上帝 说: 请Eve发言。记住你晚上的信息是\"预言家请睁眼，你查看了中央区域的第 1 张牌是 强盗, 第 2 张牌是 酒鬼.\"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n请输入你的发言: 至少没有人质疑Bob的说法和David的说法，我还是认为Charlie是狼人。除非后面有人改变了说法。如果你没有讲出事实，请务必在下一轮也就是最后一轮说清事实\n玩家 Alice 说: Eve，我理解你的观点，但我们不能仅仅凭借目前的信息就认定Charlie一定是狼人。我们还需要更多确凿的证据。同时，根据David和Bob的说法，我们还不能确定他们现在是否是狼人，因此我们不能排除其他可能性。\n\n请大家最后一轮继续分享信息，我们要确保找出狼人。不论如何，我们要在这轮结束之前找到最可疑的对象进行投票。大家加油！\n玩家 Bob 说: 各位，我们已经进入最后一轮讨论了。从目前的信息来看，虽然David、Charlie和我都有可能是狼人，但我们仍然需要更多的证据来确认。我觉得我们不能轻易下结论，否则可能会错过真实的狼人。\n\n我们要充分地利用最后一轮讨论，继续分享信息，寻找潜在的狼人。请大家慎重思考，提出自己的看法。我们要确保找到真正的狼人，切勿忽略任何线索。大家加油！\n玩家 Charlie 说: 各位，这已经是我们最后一轮讨论了。我知道有些玩家怀疑我是狼人，但实际上我现在也不能确定自己的身份。从目前的信息来看，David、Bob和我都有可能成为狼人，我们需要更多的证据来找出真正的狼人。\n\n在最后一轮讨论中，请大家认真思考，分享自己的观点和信息。我们不能因为怀疑而轻易地定性某人是狼人。为了村庄的胜利，请大家继续努力，争取找出潜在的狼人。希望大家能够慎重投票，确保我们可以找到真正的狼人！大家加油！\n玩家 David 说: 大家好，这是我们的最后一轮讨论。从目前的信息来看，我、Bob和Charlie都有可能是狼人。尽管我开始是狼人，但现在我可能已经不是了。我觉得我们应该在最后一轮讨论中充分分享信息，从而确定谁才是狼人。如果你们相信Eve是真正的预言家，那么我们可以根据她所说的中央区域的身份牌来推断其他玩家的身份。\n\n请大家在最后一轮讨论中认真思考，提出自己的看法，我们要确保找出真正的狼人。我们不能因为怀疑而下定论，否则可能会误判。希望大家能够慎重投票，确保我们可以找到真正的狼人，为村庄的胜利努力！大家加油！\n玩家 上帝 说: 请Eve发言。记住你晚上的信息是\"预言家请睁眼，你查看了中央区域的第 1 张牌是 强盗, 第 2 张牌是 酒鬼.\"。根据别人的信息，如果你觉得你是狼人，不要轻易说自己是狼人，请想办法归票到别人身上。如果你觉得你不是狼人，请想办法归票到狼人身上。\n请输入你的发言: 我还是会投Charlie，既然没有人归票，我就来归这个票，希望好人都跟我投Charlie。如果票太分散狼人可能会因此逃脱\n玩家 上帝 说: Alice 投票结果：Charlie\n玩家 上帝 说: Bob 投票结果：Charlie\n玩家 上帝 说: Charlie 投票结果：David\n玩家 上帝 说: David 投票结果：Charlie\n玩家 上帝 说: 请投票。请输出你要投票的玩家名字（Alice,Bob,Charlie,David,Eve）或者输出\"弃票\"。\n请输入你的投票目标: Charlie\n玩家 上帝 说: Eve 投票结果：Charlie\n玩家 上帝 说: 投票结果为{'Alice': 0, 'Bob': 0, 'Charlie': 4, 'David': 1, 'Eve': 0}\n玩家 上帝 说: Charlie被投票出局。他的身份是失眠者。\n玩家 上帝 说: 狼人阵营胜利。\n```"
  },
  {
    "path": "awesome_examples/tableQA.md",
    "content": "# Lesson -- Table-based Question Answering\n\nIn this lesson, we ask the model to answer questions based on a table. The table is provided as a prompt. The model is expected to answer questions about the table.\n\nComparing the following two examples, ChatGPT is vulnerable to table row order perturbation, while GPT4 is robust to table row order perturbation. Such robustness could probably be due to two reasons. The first reason is larger model size and more pretraining data of GPT4. Secondly, better truthfulness stemming from better RLHF alignment could help GPT4 follow different formats of the same instructions better. \n\nNote that smaller finetuned models heavily suffer from such non-robustness issue, according to the paper: [TableFormer: Robust Transformer Modeling for Table-Text Encoding](https://arxiv.org/pdf/2203.00274.pdf)\n\n# Example 1 (2022/04/29)\n\n## ChatGPT\n\n### Round 1\n\n#### Prompting Input\n\n```\nCan you understand the table: | Title        | Producers       | Length |\n| ------------ | --------------- | ------ |\n| Screwed Up   | Mr. Lee         | 5:02   |\n| Smile        | Sean T          | 4:32   |\n| Ghetto Queen | I.N.F.O. & NOVA | 5:00   |\n```\n\n#### Output\n\n```\nYes, I can understand the table. It represents a list of three songs with their titles, producers, and length. Each row of the table contains information about a single song. The first column titled \"Title\" represents the name of the song. The second column titled \"Producers\" indicates the names of the individuals or groups who produced the song. The third column titled \"Length\" shows the duration of each song in minutes and seconds.\n\nFor example, the first row shows that the song titled \"Screwed Up\" was produced by Mr. Lee and has a length of 5 minutes and 2 seconds. The second row shows that the song titled \"Smile\" was produced by Sean T and has a length of 4 minutes and 32 seconds. The third row shows that the song titled \"Ghetto Queen\" was produced by I.N.F.O. & NOVA and has a length of 5 minutes.\n```\n\n### Round 2:\n\n#### Prompting Input\n\n```\nOf all song lengths, which one is the longest?\n```\n\n#### Output\n\n```\nBased on the given table, the longest song length is 5 minutes and 2 seconds, which belongs to the song titled \"Screwed Up\" produced by Mr. Lee.\n```\n\n### Evidence\n\n![alt text](imgs/example_1_chatgpt_evidence.png)\n\n## GPT4\n\n### Round 1\n\n#### Prompting Input\n\n```\nCan you understand the table: | Title        | Producers       | Length |\n| ------------ | --------------- | ------ |\n| Screwed Up   | Mr. Lee         | 5:02   |\n| Smile        | Sean T          | 4:32   |\n| Ghetto Queen | I.N.F.O. & NOVA | 5:00   |\n```\n\n#### Output\n\n```\nYes, I understand the table you provided. It is a simple table with three columns, each with a specific type of information: Title, Producers, and Length. The table lists three items (presumably songs) with their respective titles, the names of the producers who worked on them, and the length of each song. Here is the information from the table:\n\nTitle: Screwed Up\nProducers: Mr. Lee\nLength: 5:02 (5 minutes and 2 seconds)\n\nTitle: Smile\nProducers: Sean T\nLength: 4:32 (4 minutes and 32 seconds)\n\nTitle: Ghetto Queen\nProducers: I.N.F.O. & NOVA\nLength: 5:00 (5 minutes)\n```\n\n### Round 2:\n\n#### Prompting Input\n\n```\nOf all song lengths, which one is the longest?\n```\n\n#### Output\n\n```\nThe longest song in the table is \"Screwed Up,\" with a length of 5 minutes and 2 seconds (5:02).\n```\n\n### Evidence\n\n![alt text](imgs/example_1_gpt4_evidence.png)\n\n\n# Example 2 (2022/04/29)\n\n## ChatGPT\n\n### Round 1\n\n#### Prompting Input\n\n```\nCan you understand the table: | Title        | Producers       | Length |\n| ------------ | --------------- | ------ |\n| Smile        | Sean T          | 4:32   |\n| Ghetto Queen | I.N.F.O. & NOVA | 5:00   |\n| Screwed Up   | Mr. Lee         | 5:02   |\n```\n\n#### Output\n\n```\nYes, I can understand the table. It consists of three columns: \"Title,\" \"Producers,\" and \"Length.\" Each row represents a different song and provides information about the title of the song, who produced it, and how long it is. The first song is called \"Smile,\" produced by Sean T, and is 4 minutes and 32 seconds long. The second song is called \"Ghetto Queen,\" produced by I.N.F.O. and NOVA, and is 5 minutes long. The third song is called \"Screwed Up,\" produced by Mr. Lee, and is 5 minutes and 2 seconds long.\n```\n\n### Round 2:\n\n#### Prompting Input\n\n```\nOf all song lengths, which one is the longest?\n```\n\n#### Output\n\n```\nThe longest song length in the table is \"Ghetto Queen\" with a length of 5:00 minutes.\n```\n\n### Evidence\n\n![alt text](imgs/example_2_chatgpt_evidence.png)\n\n## GPT4\n\n### Round 1\n\n#### Prompting Input\n\n```\nCan you understand the table: | Title        | Producers       | Length |\n| ------------ | --------------- | ------ |\n| Smile        | Sean T          | 4:32   |\n| Ghetto Queen | I.N.F.O. & NOVA | 5:00   |\n| Screwed Up   | Mr. Lee         | 5:02   |\n```\n\n#### Output\n\n```\nYes, I can understand the table. It contains information about three different items, likely songs, with their respective titles, producers, and lengths:\n\nTitle: Smile\nProducers: Sean T\nLength: 4 minutes and 32 seconds\n\nTitle: Ghetto Queen\nProducers: I.N.F.O. & NOVA\nLength: 5 minutes\n\nTitle: Screwed Up\nProducers: Mr. Lee\nLength: 5 minutes and 2 seconds\n```\n\n### Round 2:\n\n#### Prompting Input\n\n```\nOf all song lengths, which one is the longest?\n```\n\n#### Output\n\n```\nOf all the song lengths listed, \"Screwed Up\" is the longest, with a length of 5 minutes and 2 seconds.\n```\n\n### Evidence\n\n![alt text](imgs/example_2_chatgpt_evidence.png)"
  },
  {
    "path": "source/README.md",
    "content": "### change log\n\n- V1 (04/07/2023): First version of the figure. \n- V2 (04/29/2023): Second version of the figure. (The gif version is not updated)\n- V3 (08/06/2023): added Claude 2 and LLama-2-Chat\n"
  }
]