Proceeding of the 1st International Conference on Lifespan Innovation (ICLI 2025)

Artificial Intelligence Based PDF and Document Extractor Using Retrieval Augmented Generation

Authors
Swanand Kulkarni1, *, Kalpana Thakre1, Varad Kulkarni1, Aneesh Pandit1, Atharva Naik1
1Marathwada Mitramandal’s College of Engineering, Pune, 411052, India
*Corresponding author. Email: swanandkulkarni2021.comp@mmcoe.edu.in
Corresponding Author
Swanand Kulkarni
Available Online 31 August 2025.
DOI
10.2991/978-94-6463-831-8_52How to use a DOI?
Keywords
Artificial Intelligence(AI); Retrieval Augmented Generation(RAG); Natural Language Processing(NLP); Large Language Models(LLM)
Abstract

This paper introduces a RAG system for extracting semantically meaningful information from PDFs and DOCX documents. It employs a Retriever with Gemini Embeddings and FAISS indexing, and a Generator on top of Gemini 2.0 Flash for context-aware, rapid response. Evaluated on 30 documents, the system is superior to conventional approaches in semantic accuracy and retrieval precision. It is modular, scalable, and applicable to diverse real-world document understanding tasks, and has scope for future real-time and multimodal advancements.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceeding of the 1st International Conference on Lifespan Innovation (ICLI 2025)
Series
Advances in Health Sciences Research
Publication Date
31 August 2025
ISBN
978-94-6463-831-8
ISSN
2468-5739
DOI
10.2991/978-94-6463-831-8_52How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Swanand Kulkarni
AU  - Kalpana Thakre
AU  - Varad Kulkarni
AU  - Aneesh Pandit
AU  - Atharva Naik
PY  - 2025
DA  - 2025/08/31
TI  - Artificial Intelligence Based PDF and Document Extractor Using Retrieval Augmented Generation
BT  - Proceeding of the 1st International Conference on Lifespan Innovation (ICLI 2025)
PB  - Atlantis Press
SP  - 428
EP  - 435
SN  - 2468-5739
UR  - https://doi.org/10.2991/978-94-6463-831-8_52
DO  - 10.2991/978-94-6463-831-8_52
ID  - Kulkarni2025
ER  -