[
  {
    "path": "README.md",
    "content": "<div align=\"center\">\n  <h1>🛠️ Awesome LMs with Tools</h1>\n  <a href=\"https://awesome.re\">\n    <img src=\"https://awesome.re/badge.svg\" alt=\"Awesome\">\n  </a>\n  <a href=\"https://img.shields.io/badge/PRs-Welcome-red\">\n    <img src=\"https://img.shields.io/badge/PRs-Welcome-yellow\" alt=\"PRs Welcome\">\n  </a>\n  <a href=\"https://img.shields.io/badge/arXiv-2403.15452-b31b1b.svg\">\n    <img src=\"https://img.shields.io/badge/arXiv-2403.15452-b31b1b.svg\" alt=\"arXiv\">\n  </a>\n</div>\n\nLanguage models (LMs) are powerful yet mostly for text-generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills.\n\nBased on our recent survey about LM-used tools, [\"What Are Tools Anyway? A Survey from the Language Model Perspective\"](https://arxiv.org/pdf/2403.15452), we provide a structured list of literature relevant to tool-augmented LMs.\n\n- Tool basics ($\\S2$)\n- Tool use paradigm ($\\S3$)\n- Scenarios ($\\S4$)\n- Advanced methods ($\\S5$)\n- Evaluation ($\\S6$)\n\nIf you find our paper or code useful, please cite the paper:\n\n```bibtex\n@article{wang2022what,\n  title={What Are Tools Anyway? A Survey from the Language Model Perspective},\n  author={Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, Graham Neubig},\n  journal={arXiv preprint arXiv:2403.15452},\n  year={2024}\n}\n``````\n\n## $\\S2$ Tool Basics\n\n### $\\S2.1$ What are tools? 🛠️\n-  Definition and discussion of animal-used tools\n\n   **Animal tool behavior: the use and manufacture of tools by animals** *Shumaker, Robert W., Kristina R. Walkup, and Benjamin B. Beck.* 2011 [[Book](https://books.google.com/books?hl=en&lr=&id=Dx7slq__udwC&oi=fnd&pg=PT1&dq=Animal+tool+behavior:+the+use+and+manufacture+of+tools+by+animals&ots=Wf6GmSG4uI&sig=48hv2QSipGyuCcucX-GnSJHscn8#v=onepage&q=Animal%20tool%20behavior%3A%20the%20use%20and%20manufacture%20of%20tools%20by%20animals&f=false)]\n\n-  Early discussions on LM-used tools\n\n   **ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs** *Qin, Yujia, et al.* 2023.07 [[Paper]](https://openreview.net/forum?id=dHng2O0Jjr)\n\n- A survey on augmented LMs, including tool augmentation\n  \n  **Augmented Language Models: a Survey** *Mialon, Grégoire, et al.* 2023.02 [[Paper]](https://openreview.net/forum?id=jh7wH2AzKK)\n\n### $\\S2.3$ Tools and \"Agents\" 🤖\n- Definition of agents\n  \n  **Artificial intelligence a modern approach** *Russell, Stuart J., and Peter Norvig.* 2016 [[Book]](https://thuvienso.hoasen.edu.vn/handle/123456789/8967)\n\n- Survey about agents that perceive and act in the environment\n  \n  **The Rise and Potential of Large Language Model Based Agents: A Survey** *Xi, Zhiheng, et al.* 2023.09 [[Preprint]](https://arxiv.org/abs/2309.07864)\n\n- Survey about the cognitive architectures for language agents\n\n  **Cognitive Architectures for Language Agents** *Sumers, Theodore R., et al.* 2023.09 [[Paper]](https://openreview.net/forum?id=1i6ZCvflQJ)\n\n## $\\S3$ The basic tool use paradigm\n\n- Early works that set up the commonly used tooling paradigm\n  \n  **Toolformer: Language Models Can Teach Themselves to Use Tools** *Schick, Timo, et al.* 2024 [[Paper]](https://openreview.net/forum?id=Yacmpz84TH&referrer=%5Bthe%20profile%20of%20Roberto%20Dessi%5D(%2Fprofile%3Fid%3D~Roberto_Dessi1))\n\n### Inference-time prompting\n\n- Provide in-context examples for tool-using on visual programming problems\n  \n  **Visual Programming: Compositional visual reasoning without training** *Gupta, Tanmay, and Aniruddha Kembhavi.* 2023 [[Paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf)\n\n- Tool learning via in-context examples on reasoning problems involving text or multi-modal inputs\n  \n  **Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models** *Lu, Pan, et al.* 2024 [[Paper]](https://openreview.net/forum?id=HtqnVSCj3q&referrer=%5Bthe%20profile%20of%20Pan%20Lu%5D(%2Fprofile%3Fid%3D~Pan_Lu2))\n\n- In-context learning based tool using for reasoning problems in BigBench and MMLU\n  \n  **ART: Automatic multi-step reasoning and tool-use for large language models** *Paranjape, Bhargavi, et al.* 2023.03 [[Preprint]](https://arxiv.org/abs/2303.09014)\n\n- Providing tool documentation for in-context tool learning\n  \n  **Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models** *Hsieh, Cheng-Yu, et al.* 2023.08 [[Preprint]](https://arxiv.org/abs/2308.00675)\n\n### Learning by training\n\n- Training on human annotated examples of (NL input, tool-using solution output) pairs\n  \n  **API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs** *Li, Minghao, et al.* 2023.12 [[Paper]](https://aclanthology.org/2023.emnlp-main.187/)\n  \n  **Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems** *Kadlčík, Marek, et al.* 2023 [[Paper]](https://aclanthology.org/2023.emnlp-main.742.pdf)\n  \n- Training on model-synthesized examples\n  \n  **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases** *Tang, Qiaoyu, et al.* 2023.06 [[Preprint]](https://arxiv.org/abs/2306.05301)\n  \n  **ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs** *Qin, Yujia, et al.* 2023.07 [[Paper]](https://openreview.net/forum?id=dHng2O0Jjr)\n  \n  **MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use** *Huang, Yue, et al.* 2023.10 [[Paper]](https://openreview.net/forum?id=R0c2qtalgG&referrer=%5Bthe%20profile%20of%20Neil%20Zhenqiang%20Gong%5D(%2Fprofile%3Fid%3D~Neil_Zhenqiang_Gong1))\n\n  **Making Language Models Better Tool Learners with Execution Feedback** *Qiao, Shuofei, et al.* 2023.05 [[Preprint]](https://arxiv.org/abs/2305.13068)\n  \n  **LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error** *Wang, Boshi, et al.* 2024.03 [[Preprint]](https://arxiv.org/abs/2403.04746)\n\n- Self-training with bootstrapped examples\n  \n  **Toolformer: Language Models Can Teach Themselves to Use Tools** *Schick, Timo, et al.* 2024 [Paper](https://openreview.net/forum?id=Yacmpz84TH&referrer=%5Bthe%20profile%20of%20Roberto%20Dessi%5D(%2Fprofile%3Fid%3D~Roberto_Dessi1))\n\n## $\\S4$ Scenarios\n\n### Knowledge access 📚\n\n- Collect data from structured knowledge sources, e.g., databases, knowledge graphs, etc.\n  \n  **LaMDA: Language Models for Dialog Applications** *Thoppilan, Romal, et al.* 2022.01 [[Paper]](https://arxiv.org/abs/2201.08239)\n  \n  **TALM: Tool Augmented Language Models** *Parisi, Aaron, Yao Zhao, and Noah Fiedel.* 2022.05 [[Preprint]](https://arxiv.org/abs/2205.12255)\n  \n  **ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings** *Hao, Shibo, et al.* 2024 [[Paper]](https://openreview.net/forum?id=BHXsb69bSx)\n  \n  **ToolQA: A Dataset for LLM Question Answering with External Tools** *Zhuang, Yuchen, et al.* 2024 [[Paper]](https://openreview.net/forum?id=pV1xV2RK6I)\n\n  **Middleware for LLMs: Tools are Instrumental for Language Agents in Complex Environments** *Gu, Yu, et al.* 2024 [[Paper]](https://arxiv.org/abs/2402.14672)\n\n  **GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information** *Jin, Qiao, et al.* 2024 [[Paper]](https://academic.oup.com/bioinformatics/article/40/2/btae075/7606338)\n\n- Search information from the web\n  \n  **Internet-augmented language models through few-shot prompting for open-domain question answering** *Lazaridou, Angeliki, et al.* 2022.03 [[Paper]](https://arxiv.org/abs/2203.05115)\n  \n  **Internet-Augmented Dialogue Generation** *Komeili, Mojtaba, Kurt Shuster, and Jason Weston.* 2022 [[Paper]](https://aclanthology.org/2022.acl-long.579/)\n\n- Viewing retrieval models as tools under the retrieval-augmented generation context\n  \n  **Retrieval-based Language Models and Applications** *Asai, Akari, et al.* 2023 [[Tutorial]](https://aclanthology.org/2023.acl-tutorials.6/)\n  \n  **Augmented Language Models: a Survey** *Mialon, Grégoire, et al.* 2023.02 [[Paper]](https://openreview.net/forum?id=jh7wH2AzKK)\n\n### Computation activities 🔣\n\n- Using calculator for math calculations\n  \n  **Toolformer: Language Models Can Teach Themselves to Use Tools** *Schick, Timo, et al.* 2024 [[Paper]](https://openreview.net/forum?id=Yacmpz84TH&referrer=%5Bthe%20profile%20of%20Roberto%20Dessi%5D(%2Fprofile%3Fid%3D~Roberto_Dessi1))\n\n  **Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems** *Kadlčík, Marek, et al.* 2023 [[Paper]](https://aclanthology.org/2023.emnlp-main.742.pdf)\n\n- Using programs/Python interpreter to perform more complex operations\n  \n  **Pal: Program-aided language models** *Gao, Luyu, et al.* 2023 [[Paper]](https://dl.acm.org/doi/10.5555/3618408.3618843)\n  \n  **Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks** *Chen, Wenhu, et al.* 2022.11 [[Paper]](https://openreview.net/forum?id=YfZ4ZPt8zd)\n  \n  **Mint: Evaluating llms in multi-turn interaction with tools and language feedback** *Wang, Xingyao, et al.* 2023.09 [[Paper]](https://openreview.net/forum?id=jp3gWrMuIZ&referrer=%5Bthe%20profile%20of%20Hao%20Peng%5D(%2Fprofile%3Fid%3D~Hao_Peng4))\n\n  **MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning** *Das, Debrup, et al.* 2024 [[Paper]](https://aclanthology.org/2024.naacl-long.54/)\n\n  **ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving** *Gou, Zhibin, et al.* 2023.09 [[Paper]](https://openreview.net/forum?id=Ep0TtjVoap)\n\n- Tools for more advanced business activities, e.g., financial, medical, education, etc.\n  \n  **On the Tool Manipulation Capability of Open-source Large Language Models** *Xu, Qiantong, et al.* 2023.05 [[Paper]](https://openreview.net/forum?id=iShM3YolRY&referrer=%5Bthe%20profile%20of%20Changran%20Hu%5D(%2Fprofile%3Fid%3D~Changran_Hu1))\n  \n  **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases** *Tang, Qiaoyu, et al.* 2023.06 [[Preprint]](https://arxiv.org/abs/2306.05301)\n  \n  **Mint: Evaluating llms in multi-turn interaction with tools and language feedback** *Wang, Xingyao, et al.* 2023.09 [[Paper]](https://openreview.net/forum?id=jp3gWrMuIZ&referrer=%5Bthe%20profile%20of%20Hao%20Peng%5D(%2Fprofile%3Fid%3D~Hao_Peng4))\n\n  **AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning** *Jin, Qiao, et al.* 2024.02 [[Paper]](https://arxiv.org/abs/2402.13225)\n\n### Interaction with the world 🌐\n\n- Access real-time or real-world information such as weather, location, etc.\n  \n  **On the Tool Manipulation Capability of Open-source Large Language Models** *Xu, Qiantong, et al.* 2023.05 [[Paper]](https://openreview.net/forum?id=iShM3YolRY&referrer=%5Bthe%20profile%20of%20Changran%20Hu%5D(%2Fprofile%3Fid%3D~Changran_Hu1))\n  \n  **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases** *Tang, Qiaoyu, et al.* 2023.06 [[Preprint]](https://arxiv.org/abs/2306.05301)\n\n- Managing personal events such as calendar or emails\n  \n  **Toolformer: Language Models Can Teach Themselves to Use Tools** *Schick, Timo, et al.* 2024 [[Paper]](https://openreview.net/forum?id=Yacmpz84TH&referrer=%5Bthe%20profile%20of%20Roberto%20Dessi%5D(%2Fprofile%3Fid%3D~Roberto_Dessi1))\n\n- Tools in embodied environments, e.g., the Minecraft world\n  \n  **Voyager: An Open-Ended Embodied Agent with Large Language Models** *Wang, Guanzhi, et al.* 2023.05 [[Paper]](https://openreview.net/forum?id=ehfRiF0R3a)\n\n- Tools interacting with the physical world\n  \n  **ProgPrompt: Generating Situated Robot Task Plans using Large Language Models** *Singh, Ishika, et al.* 2023 [[Paper]](https://openreview.net/forum?id=3K4-U_5cRw)\n  \n  **Alfred: A benchmark for interpreting grounded instructions for everyday tasks** *Shridhar, Mohit, et al.* 2020 [[Paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Shridhar_ALFRED_A_Benchmark_for_Interpreting_Grounded_Instructions_for_Everyday_Tasks_CVPR_2020_paper.pdf)\n  \n  **Autonomous chemical research with large language models** *Boiko, Daniil A., et al.* 2023 [[Paper]](https://www.nature.com/articles/s41586-023-06792-0)\n\n### Non-textual modalities 🎞️\n\n- Tools providing access to information in non-textual modalities\n  \n  **Vipergpt: Visual inference via python execution for reasoning** *Surís, Dídac, Sachit Menon, and Carl Vondrick.* 2023 [[Paper]](https://openaccess.thecvf.com/content/ICCV2023/papers/Suris_ViperGPT_Visual_Inference_via_Python_Execution_for_Reasoning_ICCV_2023_paper.pdf)\n  \n  **MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action** *Yang, Zhengyuan, et al.* 2023.03 [[Preprint]](https://arxiv.org/abs/2303.11381)\n  \n  **AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn** *Gao, Difei, et al.* 2023.06 [[Preprint]](https://arxiv.org/abs/2306.08640)\n\n- Tools that can answer questions about data in other modalities\n  \n  **Visual Programming: Compositional visual reasoning without training** *Gupta, Tanmay, and Aniruddha Kembhavi.* 2023 [[Paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf)\n\n### Special-skilled models 🤗\n\n- Text-generation models that can perform specific tasks, e.g., question answering, machine translation\n  \n  **Toolformer: Language Models Can Teach Themselves to Use Tools** *Schick, Timo, et al.* 2024 [[Paper]](https://openreview.net/forum?id=Yacmpz84TH&referrer=%5Bthe%20profile%20of%20Roberto%20Dessi%5D(%2Fprofile%3Fid%3D~Roberto_Dessi1))\n  \n  **ART: Automatic multi-step reasoning and tool-use for large language models** *Paranjape, Bhargavi, et al.* 2023.03 [[Preprint]](https://arxiv.org/abs/2303.09014)\n\n- Integration of available models on Huggingface, TorchHub, TensorHub, etc.\n  \n  **HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face** *Shen, Yongliang, et al.* 2024 [[Paper]](https://openreview.net/forum?id=yHdTscY6Ci)\n  \n  **Gorilla: Large language model connected with massive apis** *Patil, Shishir G., et al.* 2023.05 [[Paper]](https://arxiv.org/abs/2305.15334)\n  \n  **Taskbench: Benchmarking large language models for task automation** *Shen, Yongliang, et al.* 2023.11 [[Paper]](https://openreview.net/forum?id=70xhiS0AQS&referrer=%5Bthe%20profile%20of%20Xu%20Tan%5D(%2Fprofile%3Fid%3D~Xu_Tan1))\n\n## $\\S5$ Advanced methods\n\n### $\\S5.1$ Complex tool selection and usage 🧐\n\n- Train retrievers that map natural language instructions to tool documentation\n  \n  **DocPrompting: Generating Code by Retrieving the Docs** *Zhou, Shuyan, et al.* 2022.07 [[Paper]](https://openreview.net/forum?id=ZTCxT2t2Ru)\n  \n  **ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs** *Qin, Yujia, et al.* 2023.07 [[Paper]](https://openreview.net/forum?id=dHng2O0Jjr)\n\n- Ask LMs to write hypothetical tool descriptions and search relevant tools\n  \n  **CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets** *Yuan, Lifan, et al.* 2023.09 [[Paper]](https://arxiv.org/abs/2309.17428)\n\n- Complex tool usage, e.g., parallel calls\n  \n  **Function Calling and Other API Updates** *Eleti, Atty, et al.* 2023.06 [[Blog]](https://openai.com/blog/function-calling-and-other-api-updates)\n  \n  **An LLM Compiler for Parallel Function Calling** *Kim, Sehoon, et al.* 2023.12 [[Paper]](https://arxiv.org/abs/2312.04511)\n\n### $\\S5.2$ Tools in programmatic contexts 👩‍💻\n\n- Domain-specific logical forms to query structured data\n  \n  **Semantic parsing on freebase from question-answer pairs** *Berant, Jonathan, et al.* 2013 [[Paper]](https://aclanthology.org/D13-1160/)\n  \n  **Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task** *Yu, Tao, et al.* 2018.09 [[Paper]](https://aclanthology.org/D18-1425/)\n  \n  **Break It Down: A Question Understanding Benchmark** *Wolfson, Tomer, et al.* 2020 [[Paper]](https://aclanthology.org/2020.tacl-1.13/)\n\n- Domain-specific actions for agentic tasks such as web navigation\n  \n  **Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration** *Liu, Evan Zheran, et al.* 2018.02 [[Paper]](https://openreview.net/forum?id=ryTp3f-0-)\n  \n  **WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents** *Yao, Shunyu, et al.* 2022.07 [[Paper]](https://arxiv.org/abs/2207.01206)\n  \n  **Webarena: A realistic web environment for building autonomous agents** *Zhou, Shuyan, et al.* 2023.07 [[Paper]](https://arxiv.org/abs/2307.13854)\n\n- Using external Python libraries as tools\n  \n  **ToolCoder: Teach Code Generation Models to use API search tools** *Zhang, Kechi, et al.* 2023.05 [[Paper]](https://arxiv.org/abs/2305.04032)\n\n- Using expert designed functions as tools to answer questions about images\n  \n  **Visual Programming: Compositional visual reasoning without training** *Gupta, Tanmay, and Aniruddha Kembhavi.* 2023 [[Paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf)\n  \n  **Vipergpt: Visual inference via python execution for reasoning** *Surís, Dídac, Sachit Menon, and Carl Vondrick.* 2023 [[Paper]](https://openaccess.thecvf.com/content/ICCV2023/papers/Suris_ViperGPT_Visual_Inference_via_Python_Execution_for_Reasoning_ICCV_2023_paper.pdf)\n\n- Using GPT as a tool to query external Wikipedia knowledge for table-based question answering\n  \n  **Binding Language Models in Symbolic Languages** *Cheng, Zhoujun, et al.* 2022.10 [[Paper]](https://openreview.net/forum?id=lH1PV42cbF)\n\n- Incorporate QA API and operation APIs to assist table-based question answering\n  \n  **API-Assisted Code Generation for Question Answering on Varied Table Structures** *Cao, Yihan, et al.* 2023.12 [[Paper]](https://aclanthology.org/2023.emnlp-main.897)\n\n### $\\S5.3$ Tool creation and reuse 👩‍🔬\n\n- Approaches to abstract libraries for domain-specific logical forms from a large corpus\n  \n  **DreamCoder: growing generalizable, interpretable knowledge with wake--sleep Bayesian program learning** *Ellis, Kevin, et al.* 2020.06 [[Paper]](https://arxiv.org/abs/2006.08381)\n  \n  **Leveraging Language to Learn Program Abstractions and Search Heuristics]** *Wong, Catherine, et al.* 2021 [[Paper]](https://proceedings.mlr.press/v139/wong21a.html)\n  \n  **Top-Down Synthesis for Library Learning** *Bowers, Matthew, et al.* 2023 [[Paper]](https://doi.org/10.1145/3571234)\n  \n  **LILO: Learning Interpretable Libraries by Compressing and Documenting Code** *Grand, Gabriel, et al.* 2023.10 [[Paper]](https://openreview.net/forum?id=TqYbAWKMIe)\n\n- Make and learn skills (Java programs) in the embodied Minecraft world\n  \n  **Voyager: An Open-Ended Embodied Agent with Large Language Models** *Wang, Guanzhi, et al.* 2023.05 [[Paper]](https://arxiv.org/abs/2305.16291)\n\n- Leverage LMs as tool makers on BigBench tasks\n  \n  **Large Language Models as Tool Makers** *Cai, Tianle, et al.* 2023.05 [[Preprint]](https://arxiv.org/pdf/2305.17126)\n\n- Create tools for math and table QA tasks by example-wise tool making\n  \n  **CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation** *Qian, Cheng, et al.* 2023.05 [[Paper]](https://arxiv.org/pdf/2305.14318)\n\n- Make tools via heuristic-based training and tool deduplication\n  \n  **CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets** *Yuan, Lifan, et al.* 2023.09 [[Paper]](https://arxiv.org/abs/2309.17428)\n\n- Learning tools by refactoring a small amount of programs\n  \n  **ReGAL: Refactoring Programs to Discover Generalizable Abstractions** *Stengel-Eskin, Elias, Archiki Prasad, and Mohit Bansal.* 2024.01 [[Preprint]](https://arxiv.org/abs/2401.16467)\n\n- A training-free approach to make tools via execution consistency\n  \n  🎁 **TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks** *Wang, Zhiruo, Daniel Fried, and Graham Neubig.* 2024.01 [[Preprint]](https://arxiv.org/abs/2401.12869)\n\n## $\\S6$ Evaluation: Testbeds\n\n### $\\S6.1.1$ Repurposed existing datasets\n\n- Datasets that require reasoning over texts\n  \n  **Measuring Mathematical Problem Solving With the MATH Dataset** *Hendrycks, Dan, et al.* 2021.03 [[Paper]](https://arxiv.org/pdf/2103.03874)\n  \n  **Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models** *Srivastava, Aarohi, et al.* 2022.06 [[Paper]](https://openreview.net/forum?id=uyTL5Bvosj)\n\n- Datasets that require reasoning over structured data, e.g., tables\n  \n  **Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning** *Lu, Pan, et al.* 2022.09 [[Paper]](https://arxiv.org/pdf/2209.14610)\n  \n  **Compositional Semantic Parsing on Semi-Structured Tables** *Pasupat, Panupong, and Percy Liang.* 2015 [[Paper]](https://aclanthology.org/P15-1142)\n  \n  **HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation** *Cheng, Zhoujun, et al.* 2022 [[Paper]](https://aclanthology.org/2022.acl-long.78/)\n\n- Datasets that require reasoning over other modalities, e.g., images and image pairs\n  \n  **Gqa: A new dataset for real-world visual reasoning and compositional question answering** *Hudson, Drew A., and Christopher D. Manning.* 2019.02 [[Paper]](https://arxiv.org/abs/1902.09506)\n  \n  **A Corpus for Reasoning about Natural Language Grounded in Photographs** *Suhr, Alane, et al.* 2019 [[Paper]](https://aclanthology.org/P19-1644)\n\n- Example datasets that require retriever model (tool) to solve\n  \n  **Natural Questions: A Benchmark for Question Answering Research** *Kwiatkowski, Tom, et al.* 2019 [[Paper]](https://aclanthology.org/Q19-1026)\n  \n  **TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension** *Joshi, Mandar, et al.* 2017 [[Paper]](https://aclanthology.org/P17-1147)\n\n### $\\S6.1.2$ Aggregated API benchmarks\n\n- Collect RapidAPIs and use models to synthesize examples for evaluation\n  \n  **ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs** *Qin, Yujia, et al.* 2023.07 [[Paper]](https://openreview.net/forum?id=dHng2O0Jjr)\n\n- Collect APIs from PublicAPIs and use models to synthesize examples\n  \n  **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases** *Tang, Qiaoyu, et al.* 2023.06 [[Preprint]](https://arxiv.org/abs/2306.05301)\n\n- Collect APIs from PublicAPIs and manually annotate examples for evaluation\n  \n  **API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs** *Li, Minghao, et al.* 2023.12 [[Paper]](https://aclanthology.org/2023.emnlp-main.187/)\n\n- Collect APIs from OpenAI plugin list and use models to synthesize examples\n  \n  **MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use** *Huang, Yue, et al.* 2023.10 [[Paper]](https://openreview.net/forum?id=R0c2qtalgG&referrer=%5Bthe%20profile%20of%20Neil%20Zhenqiang%20Gong%5D(%2Fprofile%3Fid%3D~Neil_Zhenqiang_Gong1))\n\n- Collect neural model tools from Huggingface hub, TorchHub, and TensorHub\n  \n  **Gorilla: Large language model connected with massive apis** *Patil, Shishir G., et al.* 2023.05 [[Paper]](https://arxiv.org/abs/2305.15334)\n\n- Collect neural model tools from Huggingface\n  \n  **HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face** *Shen, Yongliang, et al.* 2024 [[Paper]](https://openreview.net/forum?id=yHdTscY6Ci)\n\n- Collect tools from Huggingface and PublicAPIs\n  \n  **Taskbench: Benchmarking large language models for task automation** *Shen, Yongliang, et al.* 2023.11 [[Paper]](https://openreview.net/forum?id=70xhiS0AQS&referrer=%5Bthe%20profile%20of%20Xu%20Tan%5D(%2Fprofile%3Fid%3D~Xu_Tan1))\n\n- Collect Action Sequences in real-world macOS/iPadOS/iOS.\n  \n  **ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents** *Shen, Haiyang, et al.* 2024.07 [[Paper]](https://arxiv.org/abs/2407.00132)\n"
  }
]