An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features

Ashaf Uddaula; Mahmudul Hasan; Dip Sarker; Nafisa Tasneem Esha; Md Sabbir Hosen Hamim; Sadrul Amin

doi:10.2991/978-94-6239-664-7_62

<Previous Article In Volume

Next Article In Volume>

An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features

Authors

Ashaf Uddaula¹, Mahmudul Hasan¹^{, *}, Dip Sarker², Nafisa Tasneem Esha¹, Md Sabbir Hosen Hamim¹, Sadrul Amin¹

¹Department of Computer Science and Engineering, Daffodil International University, Dhaka, 1216, Bangladesh

²Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh

^*Corresponding author. Email: hasan23105101093@diu.edu.bd

Corresponding Author

Mahmudul Hasan

Available Online 8 June 2026.

DOI: 10.2991/978-94-6239-664-7_62 How to use a DOI?
Keywords: PDF malware detection; Static features; Random Forest; Ensemble learning; Explainable AI
Abstract: Identifying malicious PDF files is crucial for cybersecurity since attackers are increasingly using the flexible structure and embedded content of PDFs to circumvent signature-based defenses. This work formulates a binary classification task based on interpretable machine learning on static structural and metadata indicators to distinguish between malicious and benign PDFs. A curated pipeline resolves mixedtype entries, cleans and harmonizes fields, and maintains 19 features (such as stream markers, encryption flags, cross-reference table length, and the presence of action/scripts like JS, Javascript, and OpenAction). On a corpus of 10,026 labeled PDFs (CIC Evasive PDFMal2022), the study assesses Gaussian Naive Bayes, Decision Tree, Random Forest, and Logistic Regression. Following preprocessing, 9,708 usable samples are obtained. Experiments using stratified splits and Randomized Hyperparameter Tuning reveal that the optimized Random Forest achieves 99.48% test accuracy, surpassing both classical baselines and a Deep Learning (MLP) benchmark (98.76%). To address transparency, SHAP (SHapley Additive exPlanations) analysis is integrated, confirming that structural features and JavaScript presence drive detection logic. The proposed framework offers a repeatable, highly accurate, and explainable solution suitable for operational PDF triage.
Copyright: © 2026 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
Series: Advances in Intelligent Systems Research
Publication Date: 8 June 2026
ISBN: 978-94-6239-664-7
ISSN: 1951-6851
DOI: 10.2991/978-94-6239-664-7_62 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Ashaf Uddaula
AU  - Mahmudul Hasan
AU  - Dip Sarker
AU  - Nafisa Tasneem Esha
AU  - Md Sabbir Hosen Hamim
AU  - Sadrul Amin
PY  - 2026
DA  - 2026/06/08
TI  - An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features
BT  - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
PB  - Atlantis Press
SP  - 906
EP  - 918
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6239-664-7_62
DO  - 10.2991/978-94-6239-664-7_62
ID  - Uddaula2026
ER  -

download .riscopy to clipboard