Multimodal Deepfake Detection using Multi-Scale Transformers – A Detailed Review
- DOI
- 10.2991/978-94-6239-616-6_39How to use a DOI?
- Keywords
- Multimodal Deepfake Detection; Multi-Scale Transformers; Cross-Modal Analysis; Synthetic Media; Audio-Visual Forensics
- Abstract
This paper reviews recent progress in multimodal deepfake detection with an emphasis on multi-scale transformer architectures. It examines the challenges of detecting manipulations across both visual and audio modalities, focusing on cross-modal inconsistencies and synchronization issues. Approaches such as multi-scale attention, hybrid CNN-transformer models, and multimodal fusion are analyzed. Benchmark datasets including Face Forensics + +, Celeb-DF, DFDC, and FakeAVCeleb, are discussed for training and evaluation. The study highlights the limitations of CNN-based methods while demonstrating the advantages of transformers in capturing spatial, temporal, and auditory cues. Finally, it outlines future directions for robust and scalable deepfake detection.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - S. K. Vishal AU - S. Kanmani PY - 2026 DA - 2026/03/31 TI - Multimodal Deepfake Detection using Multi-Scale Transformers – A Detailed Review BT - Proceedings of the International Conference on Artificial Intelligence and Secure Data Analytics (ICAISDA 2025) PB - Atlantis Press SP - 518 EP - 530 SN - 1951-6851 UR - https://doi.org/10.2991/978-94-6239-616-6_39 DO - 10.2991/978-94-6239-616-6_39 ID - Vishal2026 ER -