1 |
Antematter team. Optimizing retrieval-augmented generation with advanced chunking
techniques: A comparative study, 2024. Accessed: 2025-03-31.
|
|
2 |
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.
On the dangers of stochastic parrots: Can language models be too big? In Proceedings
of FAccT, 2021.
|
|
3 |
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion
Androutsopoulos. LEGAL-BERT: The muppets straight out of law school. In Trevor
Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational
Linguistics: EMNLP 2020, pages 2898–2904, Online, November 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.261. URL https:
//aclanthology.org/2020.findings-emnlp.261/.
|
|
4 |
Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated
evaluation of retrieval augmented generation, 2023. URL https://doi.org/10.
48550/arXiv.2309.15217.
|
|
5 |
Naman Gupta. Bge-m3 vs openai embeddings: A com-
parative study. https://naman1011.medium.com/
bge-m3-model-vs-openai-embeddings-e6d6cda27d0c, 2024.
|
|
6 |
Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models
for open domain question answering. In Proceedings of EACL, 2021.
|
|
7 |
Jungwoo Kang, Jinhyuk Lee, and Jaewoo Kang. Knowledge graph-augmented
language models for knowledge-grounded dialogue generation. arXiv preprint
arXiv:2305.18846, 2023.
|
|
8 |
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey
Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain ques-
tion answering. In Proceedings of EMNLP, 2020.
|
|
9 |
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen-
eralization through memorization: Nearest neighbor language models. arXiv preprint
arXiv:1911.00172, 2020.
|
|
10 |
Joon Lee, Hyoungho Yoon, and Hyeoun-Ae Park. Explainable ai in healthcare: From
black box to interpretable models. Healthcare Informatics Research, 27(1):1–9, 2021.
|
|
11 |
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na-
man Goyal, Angela Fan, Vishrav Chaudhary, Tim Rocktäschel, and Sebastian Riedel.
Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in
Neural Information Processing Systems, 2020a.
|
|
12 |
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na-
man Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian
Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp
tasks, 2020b. URL https://doi.org/10.48550/arXiv.2005.11401.
|
|
13 |
Xiao Liu, Zihan Zhou, Tianyu Zhou, Maosong Sun, and Tianyu Wang. Bge-m3: A multi-
function embedding model for dense, sparse and multi-vector retrieval. arXiv preprint
arXiv:2402.03216, 2024. URL https://arxiv.org/abs/2402.03216.
|
|
14 |
Zuhong Liu, Charles-Elie Simon, and Fabien Caspani. Passage segmentation of doc-
uments for extractive question answering, 2025. URL https://doi.org/10.
48550/arXiv.2501.09940.
|
|
15 |
Yi Luan, Kaitao Tang, Mandar Joshi Gupta, and Luke Zettlemoyer. Sparse retrieval for
question answering. In Proceedings of ACL, 2021.
|
|
16 |
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben
Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards
for model reporting. Proceedings of the Conference on Fairness, Accountability, and
Transparency, 2019.
|
|
17 |
Niklas Muennighoff, Nizar Tazi, et al. Mteb: Massive text embedding benchmark.
https://huggingface.co/spaces/mteb/leaderboard, 2023.
|
|
18 |
Taichi Nishikawa, Soichiro Hidaka, Sho Yokoi, and Hideki Nakayama. Towards entity-
enhanced RAG: Augmenting retrieval augmented generation with entity annotation.
In Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing, 2022.
|
|
19 |
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-
Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transac-
tions on Machine Learning Research, 2023.
|
|
20 |
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese
BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing, 2019.
|
|
21 |
Krystian Safjan. From fixed-size to nlp chunking - a deep dive into text chunking tech-
niques, 2023. Accessed: 2025-03-31.
|
|
22 |
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman
worked as a babysitter: On biases in language generation. In Proceedings of EMNLP,
2019.
|
|
23 |
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information
retrieval models. In Proceedings of the International Conference on Neural Informa-
tion Processing Systems, 2022.
|
|
24 |
Rui Wang and Lili Zhao. Ai in education and policy-making: A review of recent advances.
Educational Technology Research and Development, 71(2):135–152, 2023.
|
|
25 |
Xiang Wang, Xiangyu Dong, Fuzheng Zhang, Liwei Wang, and Xing Xie. Kepler: A
unified model for knowledge embedding and pre-trained language representation, 2021.
URL https://arxiv.org/abs/1911.06136. BLOG / Survey style reference.
|
|
26 |
Andrew Yates, Sebastian Hofstätter, and Guido Zuccon. Pretrained transformers for text
ranking: Bert and beyond. arXiv preprint arXiv:2104.08663, 2021.
|
|
27 |
Zijie Zhong, Hanwen Liu, Xiaoya Cui, Xiaofan Zhang, and Zengchang Qin. Mix-of-
granularity: Optimize the chunking granularity for retrieval-augmented generation,
2025. URL https://doi.org/10.48550/arXiv.2406.00456.
|
|