Artificial Intelligence Based PDF and Document Extractor Using Retrieval Augmented Generation

Swanand Kulkarni; Kalpana Thakre; Varad Kulkarni; Aneesh Pandit; Atharva Naik

doi:10.2991/978-94-6463-831-8_52

<Previous Article In Volume

Artificial Intelligence Based PDF and Document Extractor Using Retrieval Augmented Generation

Authors

Swanand Kulkarni¹^{, *}, Kalpana Thakre¹, Varad Kulkarni¹, Aneesh Pandit¹, Atharva Naik¹

¹Marathwada Mitramandal’s College of Engineering, Pune, 411052, India

^*Corresponding author. Email: swanandkulkarni2021.comp@mmcoe.edu.in

Corresponding Author

Swanand Kulkarni

Available Online 31 August 2025.

DOI: 10.2991/978-94-6463-831-8_52 How to use a DOI?
Keywords: Artificial Intelligence(AI); Retrieval Augmented Generation(RAG); Natural Language Processing(NLP); Large Language Models(LLM)
Abstract: This paper introduces a RAG system for extracting semantically meaningful information from PDFs and DOCX documents. It employs a Retriever with Gemini Embeddings and FAISS indexing, and a Generator on top of Gemini 2.0 Flash for context-aware, rapid response. Evaluated on 30 documents, the system is superior to conventional approaches in semantic accuracy and retrieval precision. It is modular, scalable, and applicable to diverse real-world document understanding tasks, and has scope for future real-time and multimodal advancements.
Copyright: © 2025 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Volume Title: Proceeding of the 1st International Conference on Lifespan Innovation (ICLI 2025)
Series: Advances in Health Sciences Research
Publication Date: 31 August 2025
ISBN: 978-94-6463-831-8
ISSN: 2468-5739
DOI: 10.2991/978-94-6463-831-8_52 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Swanand Kulkarni
AU  - Kalpana Thakre
AU  - Varad Kulkarni
AU  - Aneesh Pandit
AU  - Atharva Naik
PY  - 2025
DA  - 2025/08/31
TI  - Artificial Intelligence Based PDF and Document Extractor Using Retrieval Augmented Generation
BT  - Proceeding of the 1st International Conference on Lifespan Innovation (ICLI 2025)
PB  - Atlantis Press
SP  - 428
EP  - 435
SN  - 2468-5739
UR  - https://doi.org/10.2991/978-94-6463-831-8_52
DO  - 10.2991/978-94-6463-831-8_52
ID  - Kulkarni2025
ER  -

download .riscopy to clipboard