An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features
- DOI
- 10.2991/978-94-6239-664-7_62How to use a DOI?
- Keywords
- PDF malware detection; Static features; Random Forest; Ensemble learning; Explainable AI
- Abstract
Identifying malicious PDF files is crucial for cybersecurity since attackers are increasingly using the flexible structure and embedded content of PDFs to circumvent signature-based defenses. This work formulates a binary classification task based on interpretable machine learning on static structural and metadata indicators to distinguish between malicious and benign PDFs. A curated pipeline resolves mixedtype entries, cleans and harmonizes fields, and maintains 19 features (such as stream markers, encryption flags, cross-reference table length, and the presence of action/scripts like JS, Javascript, and OpenAction). On a corpus of 10,026 labeled PDFs (CIC Evasive PDFMal2022), the study assesses Gaussian Naive Bayes, Decision Tree, Random Forest, and Logistic Regression. Following preprocessing, 9,708 usable samples are obtained. Experiments using stratified splits and Randomized Hyperparameter Tuning reveal that the optimized Random Forest achieves 99.48% test accuracy, surpassing both classical baselines and a Deep Learning (MLP) benchmark (98.76%). To address transparency, SHAP (SHapley Additive exPlanations) analysis is integrated, confirming that structural features and JavaScript presence drive detection logic. The proposed framework offers a repeatable, highly accurate, and explainable solution suitable for operational PDF triage.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Ashaf Uddaula AU - Mahmudul Hasan AU - Dip Sarker AU - Nafisa Tasneem Esha AU - Md Sabbir Hosen Hamim AU - Sadrul Amin PY - 2026 DA - 2026/06/08 TI - An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features BT - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025) PB - Atlantis Press SP - 906 EP - 918 SN - 1951-6851 UR - https://doi.org/10.2991/978-94-6239-664-7_62 DO - 10.2991/978-94-6239-664-7_62 ID - Uddaula2026 ER -