Proceedings of the 2024 Brawijaya International Conference (BIC 2024)

Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents

Authors
Wilsen Vesakha Lymantama1, *, Zuherman Rustam1, Fevi Novkaniza1
1Department of Mathematics, Universitas Indonesia, Depok, Indonesia
*Corresponding author. Email: wilsen.vesakha@ui.ac.id
Corresponding Author
Wilsen Vesakha Lymantama
Available Online 11 November 2025.
DOI
10.2991/978-94-6463-854-7_25How to use a DOI?
Keywords
Table extraction; PDF; YOLO; DBSCAN; PaddleOCR; Ghostscript
Abstract

Digitizing paper documents is crucial as industries shift from manual to automated processes, utilizing technologies such as optical character recognition to convert data into computer-readable formats. However, extracting tables from digital documents, especially scanned PDFs, remains challenging due to their unstructured nature, unlike native PDFs which preserve structured data and metadata. This study presents a method to effectively extract tables from both native and scanned PDFs, optimizing accuracy and processing speed. The approach begins by identifying the document type, followed by the application of the projection profile method to correct images with slanted text and advanced object detection techniques to detect table columns. Information is extracted using two different software tools and then organized in the correct order to accurately reconstruct the table content. The results demonstrate that the proposed method achieves 100% accuracy, precision, and recall for native PDFs and high accuracy for scanned PDFs, with an average character-level accuracy of 96.22% and word-level accuracy of 84.14% on a dataset of bank statements from Indonesian banks. This approach significantly enhances the accuracy and efficiency of table extraction from both native and scanned PDFs, offering a robust solution that can benefit industries such as finance, healthcare, legal, government, supply chain, and retail by automating table extraction and improving operational efficiency.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the 2024 Brawijaya International Conference (BIC 2024)
Series
Atlantis Advances in Applied Sciences
Publication Date
11 November 2025
ISBN
978-94-6463-854-7
ISSN
3091-4442
DOI
10.2991/978-94-6463-854-7_25How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Wilsen Vesakha Lymantama
AU  - Zuherman Rustam
AU  - Fevi Novkaniza
PY  - 2025
DA  - 2025/11/11
TI  - Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents
BT  - Proceedings of the 2024 Brawijaya International Conference (BIC 2024)
PB  - Atlantis Press
SP  - 337
EP  - 362
SN  - 3091-4442
UR  - https://doi.org/10.2991/978-94-6463-854-7_25
DO  - 10.2991/978-94-6463-854-7_25
ID  - Lymantama2025
ER  -