Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents

Wilsen Vesakha Lymantama; Zuherman Rustam; Fevi Novkaniza

doi:10.2991/978-94-6463-854-7_25

<Previous Article In Volume

Next Article In Volume>

Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents

Authors

Wilsen Vesakha Lymantama¹^{, *}, Zuherman Rustam¹, Fevi Novkaniza¹

¹Department of Mathematics, Universitas Indonesia, Depok, Indonesia

^*Corresponding author. Email: wilsen.vesakha@ui.ac.id

Corresponding Author

Wilsen Vesakha Lymantama

Available Online 11 November 2025.

DOI: 10.2991/978-94-6463-854-7_25 How to use a DOI?
Keywords: Table extraction; PDF; YOLO; DBSCAN; PaddleOCR; Ghostscript
Abstract: Digitizing paper documents is crucial as industries shift from manual to automated processes, utilizing technologies such as optical character recognition to convert data into computer-readable formats. However, extracting tables from digital documents, especially scanned PDFs, remains challenging due to their unstructured nature, unlike native PDFs which preserve structured data and metadata. This study presents a method to effectively extract tables from both native and scanned PDFs, optimizing accuracy and processing speed. The approach begins by identifying the document type, followed by the application of the projection profile method to correct images with slanted text and advanced object detection techniques to detect table columns. Information is extracted using two different software tools and then organized in the correct order to accurately reconstruct the table content. The results demonstrate that the proposed method achieves 100% accuracy, precision, and recall for native PDFs and high accuracy for scanned PDFs, with an average character-level accuracy of 96.22% and word-level accuracy of 84.14% on a dataset of bank statements from Indonesian banks. This approach significantly enhances the accuracy and efficiency of table extraction from both native and scanned PDFs, offering a robust solution that can benefit industries such as finance, healthcare, legal, government, supply chain, and retail by automating table extraction and improving operational efficiency.
Copyright: © 2025 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2024 Brawijaya International Conference (BIC 2024)
Series: Atlantis Advances in Applied Sciences
Publication Date: 11 November 2025
ISBN: 978-94-6463-854-7
ISSN: 3091-4442
DOI: 10.2991/978-94-6463-854-7_25 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Wilsen Vesakha Lymantama
AU  - Zuherman Rustam
AU  - Fevi Novkaniza
PY  - 2025
DA  - 2025/11/11
TI  - Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents
BT  - Proceedings of the 2024 Brawijaya International Conference (BIC 2024)
PB  - Atlantis Press
SP  - 337
EP  - 362
SN  - 3091-4442
UR  - https://doi.org/10.2991/978-94-6463-854-7_25
DO  - 10.2991/978-94-6463-854-7_25
ID  - Lymantama2025
ER  -

download .riscopy to clipboard