Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)

International Conference on Intelligent Data Analysis and Applications (IDAA 2025)

📍Dhaka, Bangladesh🗓️ 12-13 December 2025

An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features

Authors
Ashaf Uddaula1, Mahmudul Hasan1, *, Dip Sarker2, Nafisa Tasneem Esha1, Md Sabbir Hosen Hamim1, Sadrul Amin1
1Department of Computer Science and Engineering, Daffodil International University, Dhaka, 1216, Bangladesh
2Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh
*Corresponding author. Email: hasan23105101093@diu.edu.bd
Corresponding Author
Mahmudul Hasan
Available Online 8 June 2026.
DOI
10.2991/978-94-6239-664-7_62How to use a DOI?
Keywords
PDF malware detection; Static features; Random Forest; Ensemble learning; Explainable AI
Abstract

Identifying malicious PDF files is crucial for cybersecurity since attackers are increasingly using the flexible structure and embedded content of PDFs to circumvent signature-based defenses. This work formulates a binary classification task based on interpretable machine learning on static structural and metadata indicators to distinguish between malicious and benign PDFs. A curated pipeline resolves mixedtype entries, cleans and harmonizes fields, and maintains 19 features (such as stream markers, encryption flags, cross-reference table length, and the presence of action/scripts like JS, Javascript, and OpenAction). On a corpus of 10,026 labeled PDFs (CIC Evasive PDFMal2022), the study assesses Gaussian Naive Bayes, Decision Tree, Random Forest, and Logistic Regression. Following preprocessing, 9,708 usable samples are obtained. Experiments using stratified splits and Randomized Hyperparameter Tuning reveal that the optimized Random Forest achieves 99.48% test accuracy, surpassing both classical baselines and a Deep Learning (MLP) benchmark (98.76%). To address transparency, SHAP (SHapley Additive exPlanations) analysis is integrated, confirming that structural features and JavaScript presence drive detection logic. The proposed framework offers a repeatable, highly accurate, and explainable solution suitable for operational PDF triage.

Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
Series
Advances in Intelligent Systems Research
Publication Date
8 June 2026
ISBN
978-94-6239-664-7
ISSN
1951-6851
DOI
10.2991/978-94-6239-664-7_62How to use a DOI?
Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Ashaf Uddaula
AU  - Mahmudul Hasan
AU  - Dip Sarker
AU  - Nafisa Tasneem Esha
AU  - Md Sabbir Hosen Hamim
AU  - Sadrul Amin
PY  - 2026
DA  - 2026/06/08
TI  - An Ensemble Machine Learning Framework for Malicious PDF Detection Using Static and Structural Features
BT  - Proceedings of the International Conference on Intelligent Data Analysis and Applications (IDAA 2025)
PB  - Atlantis Press
SP  - 906
EP  - 918
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6239-664-7_62
DO  - 10.2991/978-94-6239-664-7_62
ID  - Uddaula2026
ER  -