Intelligent Deepfake Detector Using Audio-Visual Clues
Article Information
Abstract
Deepfake media is growing rapidly and causing significant harm. Bad actors now use AI to create fake videos that appear increasingly realistic. Traditional detection tools often fail because they analyze audio or visual signals in isolation. This paper introduces an intelligent Deepfake Detection system that addresses this limitation through a novel Multi-Modal Dispersion Framework. The system identifies subtle inconsistencies by tracking how lip movements align with speech patterns. By projecting these features into a shared latent space, the model quantifies the semantic divergence between modalities. A transformer module then captures cross-modal context to detect fine-grained manipulation artifacts. Evaluated on the DFDC and FakeAVCeleb datasets, the system achieves 94.3% accuracy, demonstrating strong potential for real-time deployment. This framework provides a reliable approach to media authentication and contributes to advancing AI safety.
Graphical Abstract
Keywords
Data Availability Statement
Funding
Conflicts of Interest
AI Use Statement
Ethical Approval and Consent to Participate
References
- Shahzad, S. A., Hashmi, A., Peng, Y. T., Tsao, Y., & Wang, H. M. (2025). AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos. IEEE Transactions on Human-Machine Systems, 55(6), 973-982.
[CrossRef] [Google Scholar] - Kharel, A., Paranjape, M., & Bera, A. (2023). DF-TransFusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention. arXiv preprint arXiv:2309.06511.
[Google Scholar] - Wang, J., Wu, Z., Ouyang, W., Han, X., Chen, J., Jiang, Y. G., & Li, S. N. (2022, June). M2tr: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 international conference on multimedia retrieval (pp. 615-623).
[CrossRef] [Google Scholar] - Anshul, A., Gopal, S., Rajan, D., & Chng, E. S. (2025). Intra-modal and Cross-modal Synchronization for Audio-visual Deepfake Detection and Temporal Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13826-13836).
[Google Scholar] - Javed, M., Zhang, Z., Dahri, F. H., Laghari, A. A., Krajčík, M., & Almadhor, A. (2025). Audio–Visual synchronization and lip movement analysis for Real-Time deepfake detection. International Journal of Computational Intelligence Systems, 18(1), 170.
[CrossRef] [Google Scholar] - Nguyen-Le, H. H., Tran, V. T., Nguyen, D. T., & Le-Khac, N. A. (2024). Passive deepfake detection across multi-modalities: A comprehensive survey. arXiv preprint arXiv:2411.17911.
[Google Scholar] - Liu, P., Tao, Q., & Zhou, J. T. (2024). Evolving from single-modal to multi-modal facial deepfake detection: A survey. arXiv preprint arXiv:2406.06965.
[Google Scholar] - Salvi, D., Liu, H., Mandelli, S., Bestagini, P., Zhou, W., Zhang, W., & Tubaro, S. (2023). A robust approach to multimodal deepfake detection. Journal of Imaging, 9(6), 122.
[CrossRef] [Google Scholar] - Haliassos, A., Vougioukas, K., Petridis, S., & Pantic, M. (2021, June). Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5037-5047). IEEE.
[CrossRef] [Google Scholar] - Bekheet, A. A., Ghoneim, A., & Khoriba, G. (2024, July). A Comprehensive Comparative Analysis of Deepfake Detection Techniques in Visual, Audio, and Audio-Visual Domains. In 2024 Intelligent Methods, Systems, and Applications (IMSA) (pp. 122-129). IEEE.
[CrossRef] [Google Scholar] - Yang, W., Zhou, X., Chen, Z., Guo, B., Ba, Z., Xia, Z., ... & Ren, K. (2023). Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18, 2015-2029.
[CrossRef] [Google Scholar] - Cozzolino, D., Rössler, A., Thies, J., Nießner, M., & Verdoliva, L. (2021, October). ID-Reveal: Identity-aware DeepFake Video Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 15088-15097). IEEE.
[CrossRef] [Google Scholar] - Zhao, H., Wei, T., Zhou, W., Zhang, W., Chen, D., & Yu, N. (2021, June). Multi-attentional Deepfake Detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2185-2194). IEEE.
[CrossRef] [Google Scholar] - Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020, October). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia (pp. 484-492).
[CrossRef] [Google Scholar] - Li, Y., & Lyu, S. (2018). Exposing DeepFake Videos By Detecting Face Warping Artifacts. arXiv preprint arXiv:1811.00656.
[Google Scholar] - Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023). Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied intelligence, 53(4), 3974-4026.
[CrossRef] [Google Scholar] - Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K., Hickerson, A., Wright, M., & Ptucha, R. (2020). Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE Journal of Selected Topics in Signal Processing, 14(5), 1024-1037.
[CrossRef] [Google Scholar] - Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Niessner, M. (2019, October). FaceForensics++: Learning to Detect Manipulated Facial Images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 1-11). IEEE.
[CrossRef] [Google Scholar] - Ilyas, H., Javed, A., & Malik, K. M. (2023). AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio–visual deepfakes detection. Applied Soft Computing, 136, 110124.
[CrossRef] [Google Scholar] - Khan, A. A., Laghari, A. A., Inam, S. A., Ullah, S., Shahzad, M., & Syed, D. (2025). A survey on multimedia-enabled deepfake detection: state-of-the-art tools and techniques, emerging trends, current challenges & limitations, and future directions. Discover Computing, 28(1), 48.
[CrossRef] [Google Scholar]
Cite This Article
TY - JOUR AU - Banik, Barnali Gupta AU - Naziya, Shaik Nidha PY - 2026 DA - 2026/03/01 TI - Intelligent Deepfake Detector Using Audio-Visual Clues JO - ICCK Transactions on Machine Intelligence T2 - ICCK Transactions on Machine Intelligence JF - ICCK Transactions on Machine Intelligence VL - 2 IS - 2 SP - 100 EP - 105 DO - 10.62762/TMI.2025.601369 UR - https://www.icck.org/article/abs/TMI.2025.601369 KW - deepfake detection KW - multi-modal dispersion KW - audio-visual clues KW - cross-modal inconsistency KW - lip-sync analysis KW - AI forensics KW - transformer fusion AB - Deepfake media is growing rapidly and causing significant harm. Bad actors now use AI to create fake videos that appear increasingly realistic. Traditional detection tools often fail because they analyze audio or visual signals in isolation. This paper introduces an intelligent Deepfake Detection system that addresses this limitation through a novel Multi-Modal Dispersion Framework. The system identifies subtle inconsistencies by tracking how lip movements align with speech patterns. By projecting these features into a shared latent space, the model quantifies the semantic divergence between modalities. A transformer module then captures cross-modal context to detect fine-grained manipulation artifacts. Evaluated on the DFDC and FakeAVCeleb datasets, the system achieves 94.3% accuracy, demonstrating strong potential for real-time deployment. This framework provides a reliable approach to media authentication and contributes to advancing AI safety. SN - 3068-7403 PB - Institute of Central Computation and Knowledge LA - English ER -
@article{Banik2026Intelligen,
author = {Barnali Gupta Banik and Shaik Nidha Naziya},
title = {Intelligent Deepfake Detector Using Audio-Visual Clues},
journal = {ICCK Transactions on Machine Intelligence},
year = {2026},
volume = {2},
number = {2},
pages = {100-105},
doi = {10.62762/TMI.2025.601369},
url = {https://www.icck.org/article/abs/TMI.2025.601369},
abstract = {Deepfake media is growing rapidly and causing significant harm. Bad actors now use AI to create fake videos that appear increasingly realistic. Traditional detection tools often fail because they analyze audio or visual signals in isolation. This paper introduces an intelligent Deepfake Detection system that addresses this limitation through a novel Multi-Modal Dispersion Framework. The system identifies subtle inconsistencies by tracking how lip movements align with speech patterns. By projecting these features into a shared latent space, the model quantifies the semantic divergence between modalities. A transformer module then captures cross-modal context to detect fine-grained manipulation artifacts. Evaluated on the DFDC and FakeAVCeleb datasets, the system achieves 94.3\% accuracy, demonstrating strong potential for real-time deployment. This framework provides a reliable approach to media authentication and contributes to advancing AI safety.},
keywords = {deepfake detection, multi-modal dispersion, audio-visual clues, cross-modal inconsistency, lip-sync analysis, AI forensics, transformer fusion},
issn = {3068-7403},
publisher = {Institute of Central Computation and Knowledge}
}
Article Metrics
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Portico