Repository: YHPeter/Awesome-RAG-Evaluation
Branch: main
Commit: 6e3a7cc4301a
Files: 4
Total size: 65.1 KB
Directory structure:
gitextract_nd47phkb/
├── LICENSE
├── README.md
├── README_cn.md
└── benchmarks.bib
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2024 YHPeter
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Awesome RAG Evaluation
English | 简体中文
The official repository for the paper: *Evaluation of Retrieval-Augmented Generation: A Survey* [Arxiv](https://arxiv.org/pdf/2405.07437). This paper has been accepted by the [2024 CCF Big Data](https://ccf.org.cn/BigData2024).
### Abstract
Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct **A** **U**nified **E**valuation **P**rocess **o**f **RA**G (*Auepora*) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
### Analysis Framework for Evaluating RAG Systems
* The **Target** modular of *Auepora*. The retrieval and generation components are highlighted in red and green, respectively.
### Reference Framework
| Category | Framework | Webpage | Paper |
|---|---|---|---|
| Tool | TruEra RAG Triad | https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad | - |
| Tool | LangChain Bench. | https://langchain-ai.github.io/langchain-benchmarks/notebooks/retrieval/langchain_docs_qa.html | - |
| Tool | Databricks Eval | https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG | - |
| Tool | RAG Playground | - | https://arxiv.org/abs/2412.12322 |
| Benchmark | RAGAs | https://github.com/explodinggradients/ragas | https://aclanthology.org/2024.eacl-demo.16 |
| Benchmark | RECALL | - | https://arxiv.org/abs/2311.08147 |
| Benchmark | ARES | https://github.com/stanford-futuredata/ARES | https://aclanthology.org/2024.naacl-long.20 |
| Benchmark | RGB | https://github.com/chen700564/RGB | https://dl.acm.org/doi/10.1609/aaai.v38i16.29728 |
| Benchmark | MultiHop-RAG | https://github.com/yixuantt/MultiHop-RAG | https://openreview.net/forum?id=t4eB3zYWBK#discussion |
| Benchmark | CRUD-RAG | https://github.com/IAAR-Shanghai/CRUD_RAG | https://dl.acm.org/doi/10.1145/3701228 |
| Benchmark | MedRAGBench | https://github.com/Teddy-XiongGZ/MedRAG | https://aclanthology.org/2024.findings-acl.372 |
| Benchmark | FeB4RAG | https://github.com/ielab/FeB4RAG | https://dl.acm.org/doi/10.1145/3626772.3657853 |
| Benchmark | CDQA | https://github.com/Alibaba-NLP/CDQA | https://aclanthology.org/2025.coling-main.695 |
| Benchmark | DomainRAG | https://github.com/ShootingWong/DomainRAG | https://arxiv.org/abs/2406.05654v2 |
| Benchmark | ReEval | https://autodebug-llm.github.io | https://aclanthology.org/2024.findings-naacl.85 |
| Benchmark | RAGBench | https://huggingface.co/datasets/rungalileo/ragbench | https://arxiv.org/abs/2407.11005 |
| Benchmark | OmniEval | https://github.com/RUC-NLPIR/OmniEval | https://arxiv.org/abs/2412.13018 |
| Benchmark | MTRAG | https://github.com/ibm/mt-rag-benchmark | https://arxiv.org/abs/2501.03468 |
| Benchmark | LegalBench-RAG | https://github.com/zeroentropy-ai/legalbenchrag | https://arxiv.org/abs/2408.10343 |
| Benchmark | eRAG | https://github.com/alirezasalemi7/eRAG | https://dl.acm.org/doi/10.1145/3626772.3657957 |
| Benchmark | CoFE-RAG | - | https://arxiv.org/abs/2410.12248 |
| Benchmark | U-NIAH | https://github.com/Tongji-KGLLM/U-NIAH | https://arxiv.org/abs/2503.00353 |
| Benchmark | CoURAGE | - | https://link.springer.com/chapter/10.1007/978-3-031-70242-6_37 |
| Benchmark | RAGEval | https://github.com/OpenBMB/RAGEval | https://arxiv.org/abs/2408.01262 |
| Benchmark | OCRRAG | https://github.com/opendatalab/OHR-Bench | https://arxiv.org/abs/2412.02592 |
| Benchmark | ArabicRAGEval | - | https://arxiv.org/abs/2403.18350 |
| Benchmark | FairnessRAG | - | https://aclanthology.org/2025.coling-main.669 |
| Benchmark | TelecomRAGEval | - | https://arxiv.org/abs/2407.12873 |
| Benchmark | CRAG | https://github.com/facebookresearch/CRAG | https://proceedings.neurips.cc/paper_files/paper/2024/hash/1435d2d0fca85a84d83ddcb754f58c29-Abstract-Datasets_and_Benchmarks_Track.html |
| Benchmark | FreshLLMs | https://github.com/freshllms/freshqa | https://aclanthology.org/2024.findings-acl.813 |
| Benchmark | InstructRAG | https://followrag.github.io | https://arxiv.org/abs/2410.09584 |
| Benchmark | SCARF | https://github.com/Eustema-S-p-A/SCARF | https://arxiv.org/pdf/2504.07803 |
### Citation
If you find this paper or repository helpful, please consider citing our work:
```
@InProceedings{Yu2025,
author = {Yu, Hao and Gan, Aoran and Zhang, Kai and Tong, Shiwei and Liu, Qi and Liu, Zhaofeng},
booktitle = {Big Data},
title = {Evaluation of Retrieval-Augmented Generation: A Survey},
year = {2025},
address = {Singapore},
editor = {Zhu, Wenwu and Xiong, Hui and Cheng, Xiuzhen and Cui, Lizhen and Dou, Zhicheng and Dong, Junyu and Pang, Shanchen and Wang, Li and Kong, Lanju and Chen, Zhenxiang},
pages = {102--120},
publisher = {Springer Nature Singapore},
isbn = {978-981-96-1024-2},
}
@misc{gan2025retrievalaugmentedgenerationevaluation,
title={Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey},
author={Aoran Gan and Hao Yu and Kai Zhang and Qi Liu and Wenyu Yan and Zhenya Huang and Shiwei Tong and Guoping Hu},
year={2025},
eprint={2504.14891},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.14891},
}
```
Citation for benchmarks: [benchmarks.bib](benchmarks.bib)
### Call for Contributions
We welcome contributions to this repository, including new benchmarks, datasets, and evaluation metrics. If you have any suggestions or would like to collaborate, please open an issue or pull request.
### Changelog
- 2024-05-11: Initial release of the paper and repository.
- 2024-06-25: Acceptance of the paper by the 2024 CCF Big Data.
- 2024-06-30: Add two benchmarks: DomainRAG and ReEval.
- 2024-07-03: Update Arxiv version to v2.
- 2024-07-16: Add multiple new benchmarks and research papers to the reference table. Update existing paper links.
- 2025-04-21: Add new benchmarks for RAG system.
================================================
FILE: README_cn.md
================================================
# Evaluation of Retrieval-Augmented Generation: A Survey
English | 简体中文
本文地址:[Arxiv](https://arxiv.org/pdf/2405.07437)
本文已被[2024年CCF大数据](https://ccf.org.cn/BigData2024)接收。
相关工作的文章和代码汇总在:[Awesome RAG Evaluation](https://github.com/YHPeter/Awesome-RAG-Evaluation)
## 摘要
RAG(Retrieval-Augmented Generation,检索增强生成)由于其复杂的结构性和对检索知识库的依赖,使得RAG系统的评估仍存在不少的挑战。为了更好地理解这些挑战和标准化RAG评估体系,我们提出了A RAG Unified Evaluation Process of RAG (Auepora)。我们对可能的输入和输出进行了整理,总结了现有RAG评估基准中的评估目标,如相关性(Relevance)、准确性(Correctness)和忠实度(Faithfulness)。同时分析了各种数据集和量化指标。最后基于当前基准测试的局限性,指出了RAG基准测试领域发展的潜在方向,为接下来的RAG评测标准提供参考。
## 引言
RAG通过整合检索到的信息提升了生成大语言模型(Large Language Model,LLM)的能力,并缓解了大语言模型面临“幻觉”和输出事实错误的回答,从而提升了内容的可靠性和丰富性。其中RAG主要由检索和生成两阶段组成。检索涉及两个部分,建立文档索引和执行文档检索,通常还会结合精拍/重排来进一步微调检索到的文档排序。准确率、召回率和F1分数是主要评价指标,但这些传统的评价指标无法反映检索结果的有效性和多样性,无法有效适配后续生成阶段所需的准确性和全面性的需求。复杂的信息来源和多样的检索策略也是另一大挑战。在生成阶段中,提示词撰写阶段,引入例如思维链(Chain of Thought,CoT)、思维树和改述响应(Rephrase and Respond,RaR)的方法,可以让模型生成质量更高的回答。最后大语言模型结合用户问题、检索到的文档和改进的提示词生成最终的响应。但如何确保生成内容的真实性、全面性、准确性和生成模型的抗干扰性仍是需要考虑的难点。特别是涉及一些创造性任务和逻辑推理任务,常用生成指标,BLEU、ROUGE和F1分数任然无法完全替代真人进行全面评估,不仅如此,还需要考虑检索和生成组件的相互关联,例如生成阶段检索信息的利用率,响应延迟、抗误导能力和和多场景下的鲁棒性,以及面对复杂的多文档信息的处理能力。
## *Auepora*(**A** **U**nified **E**valuation **P**rocess **o**f **RA**G)
为了应对这些挑战,系统性的比较RAG系统的检索和生成质量,本文提出**A** **U**nified **E**valuation **P**rocess **o**f **RA**G(*Auepora*)来应对上述挑战并整理最近的RAG评估框架。*Auepora*考虑三个方向:评估目标(Target)、评估数据集(Dataset)和量化指标(Metric),分别回答了RAG系统评估的三个关键问题:评价什么?如何评价?如何衡量?基于*Auepora*,本文(1)关注评估目标、数据集和量化指标,为RAG系统评估提供了一个更高的视角。(2)全面分析了现有的RAG基准测试,总结了它们的优势和局限性,并为RAG系统评估的未来发展提出了建议。
### 评估目标(What to Evaluate?)
在目标阶段,我们根据RAG运行中存在的“可评估结果”(EOs)和对应的“标准结果”(GTs),尽可能完整覆盖相应的评估目标(如上图)。
**检索**:主要涵盖【相关文档(Relevant Docs)与查询(Query)】之间的关系,以及【相关文档(Relevant Docs)与文档候选集(Docs Candidates)】之间的关系,前者确保检索到的文档与查询主题相关,后者确保检索到的文档在文档候选集中的排名合理。
**生成**:包括【生成的输出(Response)与问题(Query)】之间的一致性、【生成的输出(Response)与相关文档(Relevant Docs)】之间的忠实度,以及【生成的输出(Reponse)与预期输出(Sample Response)】之间的准确性。这些度量指标可以帮助评估生成的内容是否与查询相关、是否忠实于检索到的文档以及是否准确回答了问题。
此外,还有应考虑一些**额外需求**,如延迟、多样性、噪声稳健性、负拒绝和对抗性稳健性,用于确保RAG系统在实际场景中与人类偏好一致的实际适用性。这些额外需求的指标可以帮助评估RAG系统在实际应用中的表现。
### 评估数据(How to Evaluate?)
在表2中的评估框架主要采取两种策略来构建评测数据集,一种利用现有数据集,剩下的则会为特定评估目标生成新的数据集。其中几个评测框架借鉴了KILT(Knowledge Intensive Language Tasks)基准测试的一部分,如NQ、HotpotQA和FEVER,以及其他成熟的数据集,如SuperGLUE(MultiRC和ReCoRD)。但是这些数据集无法解决实时变化、更真实的场景中的挑战。类似的情况由RAGAs提及并试图通过构建更新的WikiEval来解决,WikiEval基于2022年左右之前的维基百科数据集,以此来评估RAG系统在更新的数据集中的鲁棒性。
同时大语言模型的出现彻底颠覆了数据集构建过程。现在研究人员可以根据特定的评估目标来设计数据:让较强的大语言模型根据它的理解生成”问题和答案“对用于下一步的评估,轻松创建大规模所需的数据集。RGB、MultiHop-RAG、CRUD-RAG和CDQA都采用了这种方法,他们均基于大语言模型和实时新闻构建相应的测试数据集。除新闻外,DomainRAG从每年变化的大学网站招生信息中收集数据,同时设置了各种类型的QA数据集,包括单文档、多文档、单轮和多轮对话,以此来检验RAG系统的鲁棒性。
#### 量化指标(How to Measure?)
通过量化指标和对应测试数据来精准量化评估目标是RAG评估的最后一步。然而符合人类偏好的量化指标并不容易,只能通过多方面来尽可能覆盖所有场景,也导致了量化指标的繁杂。
##### 检索指标
在检索评估中,关键在于选用能准确反映相关性、准确性和多样性的指标。这些指标不仅体现系统在获取相关信息方面的精确度,还需展示其在动态知识源中的鲁棒性。
##### 生成指标
传统指标如BLEU、ROUGE和F1仍然具有关键作用,强调精确度和召回率在输出质量方面的重要性。在文本质量方面,需关注连贯性、相关性、流畅性和与人类判断的一致性。这就需要能够评估语言生成的细微差别的指标,如事实正确性、可读性和用户对生成内容的满意度。然而,随着误报率、错误再现率和错误检测率等指标的出现,评估变得更加全面,能更好地反映生成内容的质量。
##### 额外需求
诸如延迟、多样性、噪声稳健性、负拒绝和对抗性稳健性等指标,用于确保RAG系统在实际场景中与人类偏好相符的实际适用性。深入探讨了用于评估这些额外需求的指标,以及如何将它们与传统指标相结合,以更全面评估RAG系统的性能。
### 讨论和展望
对于当下大语言模型和RAG结合的系统而言,传统的问答数据集和指标仍然是最常用的评估资源和方式。然而大语言模型在传统问答数据集上展现出的强大能力,已经无法满足我们对RAG系统评测的需求。为了全面评估整个RAG系统的性能,需要有多样化和特定于RAG的基准测试。构建一个通用的评估框架,如*Auepora*,以便更好地理解评估RAG系统的评估目标,逐渐深入评估所需的数据资源和量化指标,以便更好地理解和评估RAG系统。
传统数据集的创建是复杂且费时费力,并有一定的局限性,无法覆盖不同的且快速迭代的评测目标。对于更精细的评测而言,定制数据集已经是一个必选项。此外,数据集的多样性,从每年招生信息到实时新闻文章,都是评测的重要组成部分。但构建这些数据集的构建需要持续的投入人力和物力,难以实现自动化的评估。解法任然是大语言模型的强大能力,大语言模型可以在处理繁杂数据和自动生成问答对的,持续的实现日常或更细粒度的时间的数据集构建,以评估RAG系统在实时变化数据中的鲁棒性和性能。
在量化指标方面,使用大语言模型对回答打分已经形成一种趋势。和传统指标相比,大语言模型可以更好地反映人类对生成内容的评价;和人类打分相比,它在定制化和自动化方面更具优势。然而,使用大语言模型对输出打分也存在挑战。不同人和大模型都会对正确性、清晰度和丰富性有不同的理解,不同的提示词的评分效果可能会有所不同,如何建立一个统一的评分体系和提示词编写,在对齐人类判断、和公平的评分标准达成平衡。
除上述挑战外,还需要考虑使用大语言模型进行数据生成和验证的大量资源消耗。RAG基准测试也必须在全面评估和有限资源之间取得平衡。因此,如何实现在尽可能少的数据和计算资源下实现最佳评估效果,是未来研究的一个重要方向。
## 总结
本文探索了评估RAG系统的复杂性和挑战,并提出了一种分析RAG全流程的评估的方法(*Auepora*),重点关注评估目标、评估数据和量化指标。希望为研究人员提供一个视角更好地理解和RAG评价系统以及推动搭建更完善的RAG基准测试。
```
@InProceedings{Yu2025,
author = {Yu, Hao and Gan, Aoran and Zhang, Kai and Tong, Shiwei and Liu, Qi and Liu, Zhaofeng},
booktitle = {Big Data},
title = {Evaluation of Retrieval-Augmented Generation: A Survey},
year = {2025},
address = {Singapore},
editor = {Zhu, Wenwu and Xiong, Hui and Cheng, Xiuzhen and Cui, Lizhen and Dou, Zhicheng and Dong, Junyu and Pang, Shanchen and Wang, Li and Kong, Lanju and Chen, Zhenxiang},
pages = {102--120},
publisher = {Springer Nature Singapore},
isbn = {978-981-96-1024-2},
}
@misc{gan2025retrievalaugmentedgenerationevaluation,
title={Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey},
author={Aoran Gan and Hao Yu and Kai Zhang and Qi Liu and Wenyu Yan and Zhenya Huang and Shiwei Tong and Guoping Hu},
year={2025},
eprint={2504.14891},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.14891},
}
```
================================================
FILE: benchmarks.bib
================================================
@Article{RAGAS,
author = {Es, Shahul and James, Jithin and Espinosa-Anke, Luis and Schockaert, Steven},
title = {RAGAS: Automated Evaluation of Retrieval Augmented Generation},
year = {2023},
month = sep,
abstract = {We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution 4.0 International},
doi = {10.48550/ARXIV.2309.15217},
eprint = {2309.15217},
file = {:http\://arxiv.org/pdf/2309.15217v1:PDF},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
primaryclass = {cs.CL},
priority = {prio1},
publisher = {arXiv},
ranking = {rank5},
}
@Article{RECALL,
author = {Liu, Yi and Huang, Lianzhe and Li, Shicheng and Chen, Sishuo and Zhou, Hao and Meng, Fandong and Zhou, Jie and Sun, Xu},
title = {RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge},
year = {2023},
month = nov,
abstract = {LLMs and AI chatbots have improved people's efficiency in various fields. However, the necessary knowledge for answering the question may be beyond the models' knowledge boundaries. To mitigate this issue, many researchers try to introduce external knowledge, such as knowledge graphs and Internet contents, into LLMs for up-to-date information. However, the external information from the Internet may include counterfactual information that will confuse the model and lead to an incorrect response. Thus there is a pressing need for LLMs to possess the ability to distinguish reliable information from external knowledge. Therefore, to evaluate the ability of LLMs to discern the reliability of external knowledge, we create a benchmark from existing knowledge bases. Our benchmark consists of two tasks, Question Answering and Text Generation, and for each task, we provide models with a context containing counterfactual information. Evaluation results show that existing LLMs are susceptible to interference from unreliable external knowledge with counterfactual information, and simple intervention methods make limited contributions to the alleviation of this issue.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2311.08147},
eprint = {2311.08147},
file = {:http\://arxiv.org/pdf/2311.08147v1:PDF},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@InProceedings{MultiHop-RAG,
author = {Yixuan Tang and Yi Yang},
booktitle = {First Conference on Language Modeling},
title = {MultiHop-{RAG}: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries},
year = {2024},
url = {https://openreview.net/forum?id=t4eB3zYWBK},
}
@InProceedings{Wang2024d,
author = {Wang, Shuai and Khramtsova, Ekaterina and Zhuang, Shengyao and Zuccon, Guido},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
title = {FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation},
year = {2024},
month = jul,
pages = {763--773},
publisher = {ACM},
series = {SIGIR 2024},
collection = {SIGIR 2024},
doi = {10.1145/3626772.3657853},
}
@InProceedings{MedRAG,
author = {Xiong, Guangzhi and Jin, Qiao and Lu, Zhiyong and Zhang, Aidong},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
title = {Benchmarking Retrieval-Augmented Generation for Medicine},
year = {2024},
address = {Bangkok, Thailand},
editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek},
month = aug,
pages = {6233--6251},
publisher = {Association for Computational Linguistics},
abstract = {While large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18{\%} over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the {\textquotedblleft}lost-in-the-middle{\textquotedblright} effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.},
doi = {10.18653/v1/2024.findings-acl.372},
url = {https://aclanthology.org/2024.findings-acl.372/},
}
@InProceedings{ARES,
author = {Saad-Falcon, Jon and Khattab, Omar and Potts, Christopher and Zaharia, Matei},
booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
title = {{ARES}: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems},
year = {2024},
address = {Mexico City, Mexico},
editor = {Duh, Kevin and Gomez, Helena and Bethard, Steven},
month = jun,
pages = {338--354},
publisher = {Association for Computational Linguistics},
abstract = {Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARES accurately evaluates RAG systems while using only a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our code and datasets publicly available on Github.},
doi = {10.18653/v1/2024.naacl-long.20},
url = {https://aclanthology.org/2024.naacl-long.20/},
}
@InProceedings{RGB,
author = {Chen, Jiawei and Lin, Hongyu and Han, Xianpei and Sun, Le},
booktitle = {Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence},
title = {Benchmarking large language models in retrieval-augmented generation},
year = {2024},
publisher = {AAAI Press},
series = {AAAI'24/IAAI'24/EAAI'24},
abstract = {Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.},
articleno = {1980},
doi = {10.1609/aaai.v38i16.29728},
isbn = {978-1-57735-887-9},
numpages = {9},
url = {https://doi.org/10.1609/aaai.v38i16.29728},
}
@Article{RAGBench,
author = {Friel, Robert and Belyi, Masha and Sanyal, Atindriyo},
title = {RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems},
year = {2024},
month = jun,
abstract = {Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for incorporating domain-specific knowledge into user-facing chat applications powered by Large Language Models (LLMs). RAG systems are characterized by (1) a document retriever that queries a domain-specific corpus for context information relevant to an input query, and (2) an LLM that generates a response based on the provided query and context. However, comprehensive evaluation of RAG systems remains a challenge due to the lack of unified evaluation criteria and annotated datasets. In response, we introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications. Further, we formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains. We release the labeled dataset at https://huggingface.co/datasets/rungalileo/ragbench. RAGBench explainable labels facilitate holistic evaluation of RAG systems, enabling actionable feedback for continuous improvement of production applications. Thorough extensive benchmarking, we find that LLM-based RAG evaluation methods struggle to compete with a finetuned RoBERTa model on the RAG evaluation task. We identify areas where existing approaches fall short and propose the adoption of RAGBench with TRACe towards advancing the state of RAG evaluation systems.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution 4.0 International},
doi = {10.48550/ARXIV.2407.11005},
eprint = {2407.11005},
file = {:http\://arxiv.org/pdf/2407.11005v2:PDF},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Article{OmniEval,
author = {Wang, Shuting and Tan, Jiejun and Dou, Zhicheng and Wen, Ji-Rong},
title = {OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
year = {2024},
month = dec,
abstract = {As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2412.13018},
eprint = {2412.13018},
file = {:http\://arxiv.org/pdf/2412.13018v2:PDF},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Article{LegalBench-RAG,
author = {Pipitone, Nicholas and Alami, Ghita Houir},
title = {LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain},
year = {2024},
month = aug,
abstract = {Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at https://github.com/zeroentropy-cc/legalbenchrag.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution 4.0 International},
doi = {10.48550/ARXIV.2408.10343},
eprint = {2408.10343},
file = {:http\://arxiv.org/pdf/2408.10343v1:PDF},
keywords = {Artificial Intelligence (cs.AI), FOS: Computer and information sciences},
primaryclass = {cs.AI},
publisher = {arXiv},
}
@InProceedings{eRAG,
author = {Salemi, Alireza and Zamani, Hamed},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
title = {Evaluating Retrieval Quality in Retrieval-Augmented Generation},
year = {2024},
address = {New York, NY, USA},
pages = {2395–2400},
publisher = {Association for Computing Machinery},
series = {SIGIR '24},
abstract = {Evaluating retrieval-augmented generation (RAG) presents challenges, particularly for retrieval models within these systems. Traditional end-to-end evaluation methods are computationally expensive. Furthermore, evaluation of the retrieval model's performance based on query-document relevance labels shows a small correlation with the RAG system's downstream performance. We propose a novel evaluation approach, eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. In this manner, the downstream performance for each document serves as its relevance label. We employ various downstream task metrics to obtain document-level annotations and aggregate them using set-based or ranking metrics. Extensive experiments on a wide range of datasets demonstrate that eRAG achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's tau correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.},
doi = {10.1145/3626772.3657957},
isbn = {9798400704314},
keywords = {evaluation, retrieval quality, retrieval-augmented generation},
location = {Washington DC, USA},
numpages = {6},
url = {https://doi.org/10.1145/3626772.3657957},
}
@Article{CoFE-RAG,
author = {Liu, Jintao and Ding, Ruixue and Zhang, Linhao and Xie, Pengjun and Huang, Fie},
title = {CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity},
year = {2024},
month = oct,
abstract = {Retrieval-Augmented Generation (RAG) aims to enhance large language models (LLMs) to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources, thereby reducing the incidence of hallucinations. Despite the advancements, evaluating these systems remains a crucial research area due to the following issues: (1) Limited data diversity: The insufficient diversity of knowledge sources and query types constrains the applicability of RAG systems; (2) Obscure problems location: Existing evaluation methods have difficulty in locating the stage of the RAG pipeline where problems occur; (3) Unstable retrieval evaluation: These methods often fail to effectively assess retrieval performance, particularly when the chunking strategy changes. To tackle these challenges, we propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline, including chunking, retrieval, reranking, and generation. To effectively evaluate the first three phases, we introduce multi-granularity keywords, including coarse-grained and fine-grained keywords, to assess the retrieved context instead of relying on the annotation of golden chunks. Moreover, we release a holistic benchmark dataset tailored for diverse data scenarios covering a wide range of document formats and query types. We demonstrate the utility of the CoFE-RAG framework by conducting experiments to evaluate each stage of RAG systems. Our evaluation method provides unique insights into the effectiveness of RAG systems in handling diverse data scenarios, offering a more nuanced understanding of their capabilities and limitations.},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.2410.12248},
eprint = {2410.12248},
file = {:http\://arxiv.org/pdf/2410.12248v1:PDF},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@InProceedings{CoURAGE,
author = {Galla, Divyanshi and Hoda, Shaz and Zhang, Meiwei and Quan, Wenzhe and Yang, Tommy Dong and Voyles, Joseph},
booktitle = {Natural Language Processing and Information Systems},
title = {CoURAGE: A Framework to Evaluate RAG Systems},
year = {2024},
address = {Cham},
editor = {Rapp, Amon and Di Caro, Luigi and Meziane, Farid and Sugumaran, Vijayan},
pages = {392--407},
publisher = {Springer Nature Switzerland},
abstract = {In the rapidly evolving domain of Generative AI(GenAI), evaluating models' effectiveness for a business use case remains a significant challenge, particularly due to the diverse array of available metrics, the absence of a standardized framework for their application and varied challenges in use cases. This paper proposes a structured framework de signed to assist practitioners, including new adopters, in selecting appropriate metrics for the evaluation of GenAI models, specifically within question answering (QA) systems. The framework focuses on considerations such as data availability, the nature of the dataset, and the necessity for Large Language Models (LLMs) calls for evaluation. By categorizing metrics into quantitative and qualitative types, and distinguishing between scenarios that require golden labels, this framework seeks to streamline the evaluation process.},
isbn = {978-3-031-70242-6},
}
@Article{RAGEval,
author = {Zhu, Kunlun and Luo, Yifan and Xu, Dingling and Yan, Yukun and Liu, Zhenghao and Yu, Shi and Wang, Ruobing and Wang, Shuo and Li, Yishan and Zhang, Nan and Han, Xu and Liu, Zhiyuan and Sun, Maosong},
title = {RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework},
year = {2024},
month = aug,
abstract = {Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
doi = {10.48550/ARXIV.2408.01262},
eprint = {2408.01262},
file = {:http\://arxiv.org/pdf/2408.01262v5:PDF},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Article{OCRRAG,
author = {Zhang, Junyuan and Zhang, Qintong and Wang, Bin and Ouyang, Linke and Wen, Zichen and Li, Ying and Chow, Ka-Ho and He, Conghui and Zhang, Wentao},
title = {OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation},
year = {2024},
month = dec,
abstract = {Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG. To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG performance. Our OHRBench, including PDF documents, Q&As, and the ground truth structured data are released at: https://github.com/opendatalab/OHR-Bench},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.2412.02592},
eprint = {2412.02592},
file = {:http\://arxiv.org/pdf/2412.02592v2:PDF},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences},
primaryclass = {cs.CV},
publisher = {arXiv},
}
@Article{RAG-Playground,
author = {Papadimitriou, Ioannis and Gialampoukidis, Ilias and Vrochidis, Stefanos and Ioannis, and {Kompatsiaris}},
title = {RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems},
year = {2024},
month = dec,
abstract = {We present RAG Playground, an open-source framework for systematic evaluation of Retrieval-Augmented Generation (RAG) systems. The framework implements and compares three retrieval approaches: naive vector search, reranking, and hybrid vector-keyword search, combined with ReAct agents using different prompting strategies. We introduce a comprehensive evaluation framework with novel metrics and provide empirical results comparing different language models (Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our experiments demonstrate significant performance improvements through hybrid search methods and structured self-evaluation prompting, achieving up to 72.7% pass rate on our multi-metric evaluation framework. The results also highlight the importance of prompt engineering in RAG systems, with our custom-prompted agents showing consistent improvements in retrieval accuracy and response quality.},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.2412.12322},
eprint = {2412.12322},
file = {:http\://arxiv.org/pdf/2412.12322v1:PDF},
keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences},
primaryclass = {cs.LG},
publisher = {arXiv},
}
@Article{ArabicRAGEval,
author = {Mahboub, Ali and Za'ter, Muhy Eddin and Al-Rfooh, Bashar and Estaitia, Yazan and Jaljuli, Adnan and Hakouz, Asma},
title = {Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language},
year = {2024},
month = mar,
abstract = {The latest advancements in machine learning and deep learning have brought forth the concept of semantic similarity, which has proven immensely beneficial in multiple applications and has largely replaced keyword search. However, evaluating semantic similarity and conducting searches for a specific query across various documents continue to be a complicated task. This complexity is due to the multifaceted nature of the task, the lack of standard benchmarks, whereas these challenges are further amplified for Arabic language. This paper endeavors to establish a straightforward yet potent benchmark for semantic search in Arabic. Moreover, to precisely evaluate the effectiveness of these metrics and the dataset, we conduct our assessment of semantic search within the framework of retrieval augmented generation (RAG).},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution 4.0 International},
doi = {10.48550/ARXIV.2403.18350},
eprint = {2403.18350},
file = {:http\://arxiv.org/pdf/2403.18350v2:PDF},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Article{TelecomRAGEval,
author = {Roychowdhury, Sujoy and Soman, Sumit and Ranjani, H G and Gunda, Neeraj and Chhabra, Vansh and Bala, Sai Krishna},
title = {Evaluation of RAG Metrics for Question Answering in the Telecom Domain},
year = {2024},
month = jul,
abstract = {Retrieval Augmented Generation (RAG) is widely used to enable Large Language Models (LLMs) perform Question Answering (QA) tasks in various domains. However, RAG based on open-source LLM for specialized domains has challenges of evaluating generated responses. A popular framework in the literature is the RAG Assessment (RAGAS), a publicly available library which uses LLMs for evaluation. One disadvantage of RAGAS is the lack of details of derivation of numerical value of the evaluation metrics. One of the outcomes of this work is a modified version of this package for few metrics (faithfulness, context relevance, answer relevance, answer correctness, answer similarity and factual correctness) through which we provide the intermediate outputs of the prompts by using any LLMs. Next, we analyse the expert evaluations of the output of the modified RAGAS package and observe the challenges of using it in the telecom domain. We also study the effect of the metrics under correct vs. wrong retrieval and observe that few of the metrics have higher values for correct retrieval. We also study for differences in metrics between base embeddings and those domain adapted via pre-training and fine-tuning. Finally, we comment on the suitability and challenges of using these metrics for in-the-wild telecom QA task.},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.2407.12873},
eprint = {2407.12873},
file = {:http\://arxiv.org/pdf/2407.12873v1:PDF},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, I.2.7, 68T50},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@InProceedings{CRAG,
author = {Yang, Xiao and Sun, Kai and Xin, Hao and Sun, Yushi and Bhalla, Nikita and Chen, Xiangsen and Choudhary, Sajal and Gui, Rongze Daniel and Jiang, Ziran Will and Jiang, Ziyu and Kong, Lingkun and Moran, Brian and Wang, Jiaqi and Xu, Yifan Ethan and Yan, An and Yang, Chenyu and Yuan, Eting and Zha, Hanwen and Tang, Nan and Chen, Lei and Scheffer, Nicolas and Liu, Yue and Shah, Nirav and Wanga, Rakesh and Kumar, Anuj and Yih, Wen-tau and Dong, Xin Luna},
booktitle = {Advances in Neural Information Processing Systems},
title = {CRAG - Comprehensive RAG Benchmark},
year = {2024},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {10470--10490},
publisher = {Curran Associates, Inc.},
volume = {37},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/1435d2d0fca85a84d83ddcb754f58c29-Paper-Datasets_and_Benchmarks_Track.pdf},
}
@InProceedings{FreshLLMs,
author = {Vu, Tu and Iyyer, Mohit and Wang, Xuezhi and Constant, Noah and Wei, Jerry and Wei, Jason and Tar, Chris and Sung, Yun-Hsuan and Zhou, Denny and Le, Quoc and Luong, Thang},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
title = {{F}resh{LLM}s: Refreshing Large Language Models with Search Engine Augmentation},
year = {2024},
address = {Bangkok, Thailand},
editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek},
month = aug,
pages = {13697--13720},
publisher = {Association for Computational Linguistics},
abstract = {Since most large language models (LLMs) are trained once and never updated, they struggle to dynamically adapt to our ever-changing world. In this work, we present FreshQA, a dynamic QA benchmark that tests a model`s ability to answer questions that may require reasoning over up-to-date world knowledge. We develop a two-mode human evaluation procedure to measure both correctness and hallucination, which we use to benchmark both closed and open-source LLMs by collecting {\ensuremath{>}}50K human judgments. We observe that all LLMs struggle to answer questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. In response, we develop FreshPrompt, a few-shot prompting method that curates and organizes relevant information from a search engine into an LLM`s prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. To facilitate future work, we additionally develop FreshEval, a reliable autorater for quick evaluation and comparison on FreshQA. Our latest results with FreshEval suggest that open-source LLMs such as Mixtral (Jiang et al., 2024), when combined with FreshPrompt, are competitive with closed-source and commercial systems on search-augmented QA.},
doi = {10.18653/v1/2024.findings-acl.813},
url = {https://aclanthology.org/2024.findings-acl.813/},
}
@Article{InstructRAG,
author = {Dong, Guanting and Song, Xiaoshuai and Zhu, Yutao and Qiao, Runqi and Dou, Zhicheng and Wen, Ji-Rong},
title = {Toward General Instruction-Following Alignment for Retrieval-Augmented Generation},
year = {2024},
month = oct,
abstract = {Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (<100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset (>100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems. Our code and datasets are released at https://FollowRAG.github.io.},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.2410.09584},
eprint = {2410.09584},
file = {:http\://arxiv.org/pdf/2410.09584v1:PDF},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Article{CRUD-RAG,
author = {Lyu, Yuanjie and Li, Zhiyu and Niu, Simin and Xiong, Feiyu and Tang, Bo and Wang, Wenjin and Wu, Hao and Liu, Huanyong and Xu, Tong and Chen, Enhong},
journal = {ACM Trans. Inf. Syst.},
title = {CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models},
year = {2025},
issn = {1046-8188},
month = jan,
number = {2},
volume = {43},
abstract = {Retrieval-augmented generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This method addresses common LLM limitations, including outdated information and the tendency to produce inaccurate “hallucinated” content. However, evaluating RAG systems is a challenge. Most benchmarks focus primarily on question-answering applications, neglecting other potential scenarios where RAG could be beneficial. Accordingly, in the experiments, these benchmarks often assess only the LLM components of the RAG pipeline or the retriever in knowledge-intensive scenarios, overlooking the impact of external knowledge base construction and the retrieval component on the entire RAG pipeline in non-knowledge-intensive scenarios. To address these issues, this article constructs a large-scale and more comprehensive benchmark and evaluates all the components of RAG systems in various RAG application scenarios. Specifically, we refer to the CRUD actions that describe interactions between users and knowledge bases and also categorize the range of RAG applications into four distinct types—create, read, update, and delete (CRUD). “Create” refers to scenarios requiring the generation of original, varied content. “Read” involves responding to intricate questions in knowledge-intensive situations. “Update” focuses on revising and rectifying inaccuracies or inconsistencies in pre-existing texts. “Delete” pertains to the task of summarizing extensive texts into more concise forms. For each of these CRUD categories, we have developed different datasets to evaluate the performance of RAG systems. We also analyze the effects of various components of the RAG system, such as the retriever, context length, knowledge base construction, and LLM. Finally, we provide useful insights for optimizing the RAG technology for different scenarios. The source code is available at GitHub: .},
address = {New York, NY, USA},
articleno = {41},
doi = {10.1145/3701228},
issue_date = {March 2025},
keywords = {Retrieval-Augmented Generation, Large Language Models, Evaluation},
numpages = {32},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3701228},
}
@InProceedings{CDQA,
author = {Xu, Zhikun and Li, Yinghui and Ding, Ruixue and Wang, Xinyu and Chen, Boli and Jiang, Yong and Zheng, Haitao and Lu, Wenlian and Xie, Pengjun and Huang, Fei},
booktitle = {Proceedings of the 31st International Conference on Computational Linguistics},
title = {Let {LLM}s Take on the Latest Challenges! A {C}hinese Dynamic Question Answering Benchmark},
year = {2025},
address = {Abu Dhabi, UAE},
editor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},
month = jan,
pages = {10435--10448},
publisher = {Association for Computational Linguistics},
abstract = {How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point and hot topic in current LLMs research. Previous work has noted that due to the extremely high cost of iterative updates of LLMs, they are often unable to answer the latest dynamic questions well. To promote the improvement of Chinese LLMs' ability to answer dynamic questions, in this paper, we introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models, and carefully classify the samples according to the frequency of answer changes to facilitate a more fine-grained observation of LLMs' capabilities. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA. Extensive experiments and valuable insights suggest that our proposed CDQA is challenging and worthy of more further study. We believe that the benchmark we provide will become one of the key data resources for improving LLMs' Chinese question-answering ability in the future.},
url = {https://aclanthology.org/2025.coling-main.695/},
}
@Article{MTRAG,
author = {Katsis, Yannis and Rosenthal, Sara and Fadnis, Kshitij and Gunasekara, Chulaka and Lee, Young-Suk and Popa, Lucian and Shah, Vraj and Zhu, Huaiyu and Contractor, Danish and Danilevsky, Marina},
title = {MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems},
year = {2025},
month = jan,
abstract = {Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution Share Alike 4.0 International},
doi = {10.48550/ARXIV.2501.03468},
eprint = {2501.03468},
file = {:http\://arxiv.org/pdf/2501.03468v1:PDF},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Article{U-NIAH,
author = {Gao, Yunfan and Xiong, Yun and Wu, Wenlong and Huang, Zijing and Li, Bohan and Wang, Haofen},
title = {U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack},
year = {2025},
month = mar,
abstract = {Recent advancements in Large Language Models (LLMs) have expanded their context windows to unprecedented lengths, sparking debates about the necessity of Retrieval-Augmented Generation (RAG). To address the fragmented evaluation paradigms and limited cases in existing Needle-in-a-Haystack (NIAH), this paper introduces U-NIAH, a unified framework that systematically compares LLMs and RAG methods in controlled long context settings. Our framework extends beyond traditional NIAH by incorporating multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings, while leveraging the synthetic Starlight Academy dataset-a fictional magical universe-to eliminate biases from pre-trained knowledge. Through extensive experiments, we investigate three research questions: (1) performance trade-offs between LLMs and RAG, (2) error patterns in RAG, and (3) RAG's limitations in complex settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness, achieving an 82.58% win-rate over LLMs. However, we observe that retrieval noise and reverse chunk ordering degrade performance, while surprisingly, advanced reasoning LLMs exhibit reduced RAG compatibility due to sensitivity to semantic distractors. We identify typical error patterns including omission due to noise, hallucination under high noise critical condition, and self-doubt behaviors. Our work not only highlights the complementary roles of RAG and LLMs, but also provides actionable insights for optimizing deployments. Code: https://github.com/Tongji-KGLLM/U-NIAH.},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.2503.00353},
eprint = {2503.00353},
file = {:http\://arxiv.org/pdf/2503.00353v1:PDF},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@InProceedings{FairnessRAG,
author = {Wu, Xuyang and Li, Shuowei and Wu, Hsin-Tai and Tao, Zhiqiang and Fang, Yi},
booktitle = {Proceedings of the 31st International Conference on Computational Linguistics},
title = {Does {RAG} Introduce Unfairness in {LLM}s? Evaluating Fairness in Retrieval-Augmented Generation Systems},
year = {2025},
address = {Abu Dhabi, UAE},
editor = {Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven},
month = jan,
pages = {10021--10036},
publisher = {Association for Computational Linguistics},
abstract = {Retrieval-Augmented Generation (RAG) has recently gained significant attention for its enhanced ability to integrate external knowledge sources into open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness considerations may have been largely overlooked. Second, the complex, multi-component architecture of RAG methods poses challenges in identifying and mitigating biases, as each component is optimized for distinct objectives. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG, using scenario-based questions and analyzing disparities across demographic attributes. Our experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages. These findings underscore the need for targeted interventions to address fairness concerns throughout the RAG pipeline. The dataset and code used in this study are publicly available at this GitHub Repository.},
url = {https://aclanthology.org/2025.coling-main.669/},
}
@Article{SCARF,
author = {Rengo, Mattia and Beadini, Senad and Alfano, Domenico and Abbruzzese, Roberto},
title = {A System for Comprehensive Assessment of RAG Frameworks},
year = {2025},
month = apr,
abstract = {Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. To address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Moreover, SCARF integrates practical considerations such as response coherence, providing a scalable and adaptable solution for researchers and industry professionals evaluating RAG applications. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.},
archiveprefix = {arXiv},
copyright = {Creative Commons Attribution 4.0 International},
doi = {10.48550/ARXIV.2504.07803},
eprint = {2504.07803},
file = {:http\://arxiv.org/pdf/2504.07803v1:PDF},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences},
primaryclass = {cs.CL},
publisher = {arXiv},
}
@Misc{TTFT,
author = {Anthropic},
month = jan,
title = {Reducing latency},
year = {2025},
url = {https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/reduce-latency},
}