Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents
- DOI
- 10.2991/978-94-6463-854-7_25How to use a DOI?
- Keywords
- Table extraction; PDF; YOLO; DBSCAN; PaddleOCR; Ghostscript
- Abstract
Digitizing paper documents is crucial as industries shift from manual to automated processes, utilizing technologies such as optical character recognition to convert data into computer-readable formats. However, extracting tables from digital documents, especially scanned PDFs, remains challenging due to their unstructured nature, unlike native PDFs which preserve structured data and metadata. This study presents a method to effectively extract tables from both native and scanned PDFs, optimizing accuracy and processing speed. The approach begins by identifying the document type, followed by the application of the projection profile method to correct images with slanted text and advanced object detection techniques to detect table columns. Information is extracted using two different software tools and then organized in the correct order to accurately reconstruct the table content. The results demonstrate that the proposed method achieves 100% accuracy, precision, and recall for native PDFs and high accuracy for scanned PDFs, with an average character-level accuracy of 96.22% and word-level accuracy of 84.14% on a dataset of bank statements from Indonesian banks. This approach significantly enhances the accuracy and efficiency of table extraction from both native and scanned PDFs, offering a robust solution that can benefit industries such as finance, healthcare, legal, government, supply chain, and retail by automating table extraction and improving operational efficiency.
- Copyright
- © 2025 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Wilsen Vesakha Lymantama AU - Zuherman Rustam AU - Fevi Novkaniza PY - 2025 DA - 2025/11/11 TI - Optimizing Table Extraction from PDFs: A Dual-Approach Solution for Native and Scanned Documents BT - Proceedings of the 2024 Brawijaya International Conference (BIC 2024) PB - Atlantis Press SP - 337 EP - 362 SN - 3091-4442 UR - https://doi.org/10.2991/978-94-6463-854-7_25 DO - 10.2991/978-94-6463-854-7_25 ID - Lymantama2025 ER -