Intelligent Deepfake Detector Using Audio-Visual Clues

Barnali Gupta Banik; Shaik Nidha Naziya

doi:10.62762/TMI.2025.601369

Volume 2, Issue 2, ICCK Transactions on Machine Intelligence

Volume 2, Issue 2, 2026

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Reinforcement Learning for Prompt Optimization in Language Models: A Comprehensive Survey of Methods, Representations, and Evaluation Challenges AI and the Future of Education: Advancing Personalized Learning and Intelligent Tutoring Systems Reservoir Science: A Multi-Coupling Communication Platform to Promote Energy Transformation, Climate Change and Environmental Protection From CO$_2$ Sequestration to Hydrogen Storage: Further Utilization of Depleted Gas Reservoirs Effects of Crosslinking Agents and Reservoir Conditions on the Propagation of Fractures in Coal Reservoirs During Hydraulic Fracturing Plant Disease Detection Using Deep Learning Techniques Modeling Brain Functional Networks Using Graph Neural Networks: A Review and Clinical Application The Influence of Geological Factors and Transmission Fluids on the Exploitation of Reservoir Geothermal Resources: Factor Discussion and Mechanism Analysis Current Status and Development Prospects of Carbon Capture, Utilization, and Storage (CCUS) in China: Technical, Policy, and Market Perspectives

ICCK Transactions on Machine Intelligence, Volume 2, Issue 2, 2026: 100-105

Free to Read | Research Article | 01 March 2026

Intelligent Deepfake Detector Using Audio-Visual Clues

Barnali Gupta Banik 1 *

Shaik Nidha Naziya 2

1 Mahatma Gandhi Institute of Technology, Hyderabad, Telangana, India

2 Malla Reddy Engineering College for Women, Hyderabad, Telangana, India

* Corresponding Author: Barnali Gupta Banik, [email protected]

DOI: 10.62762/TMI.2025.601369

ARK: ark:/57805/tmi.2025.601369

Received: 21 September 2025, Accepted: 20 January 2026, Published: 01 March 2026

PDF (987.87 KB)

Article Metrics Cite This Article

Abstract

Deepfake media is growing rapidly and causing significant harm. Bad actors now use AI to create fake videos that appear increasingly realistic. Traditional detection tools often fail because they analyze audio or visual signals in isolation. This paper introduces an intelligent Deepfake Detection system that addresses this limitation through a novel Multi-Modal Dispersion Framework. The system identifies subtle inconsistencies by tracking how lip movements align with speech patterns. By projecting these features into a shared latent space, the model quantifies the semantic divergence between modalities. A transformer module then captures cross-modal context to detect fine-grained manipulation artifacts. Evaluated on the DFDC and FakeAVCeleb datasets, the system achieves 94.3% accuracy, demonstrating strong potential for real-time deployment. This framework provides a reliable approach to media authentication and contributes to advancing AI safety.

Graphical Abstract

Intelligent Deepfake Detector Using Audio-Visual Clues

Keywords

deepfake detection

multi-modal dispersion

audio-visual clues

cross-modal inconsistency

lip-sync analysis

AI forensics

transformer fusion

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

AI Use Statement

The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate

Not applicable.

References

Shahzad, S. A., Hashmi, A., Peng, Y. T., Tsao, Y., & Wang, H. M. (2025). AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos. IEEE Transactions on Human-Machine Systems, 55(6), 973-982.
[CrossRef] [Google Scholar]
Kharel, A., Paranjape, M., & Bera, A. (2023). DF-TransFusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention. arXiv preprint arXiv:2309.06511.
[Google Scholar]
Wang, J., Wu, Z., Ouyang, W., Han, X., Chen, J., Jiang, Y. G., & Li, S. N. (2022, June). M2tr: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 international conference on multimedia retrieval (pp. 615-623).
[CrossRef] [Google Scholar]
Anshul, A., Gopal, S., Rajan, D., & Chng, E. S. (2025). Intra-modal and Cross-modal Synchronization for Audio-visual Deepfake Detection and Temporal Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13826-13836).
[Google Scholar]
Javed, M., Zhang, Z., Dahri, F. H., Laghari, A. A., Krajčík, M., & Almadhor, A. (2025). Audio–Visual synchronization and lip movement analysis for Real-Time deepfake detection. International Journal of Computational Intelligence Systems, 18(1), 170.
[CrossRef] [Google Scholar]
Nguyen-Le, H. H., Tran, V. T., Nguyen, D. T., & Le-Khac, N. A. (2024). Passive deepfake detection across multi-modalities: A comprehensive survey. arXiv preprint arXiv:2411.17911.
[Google Scholar]
Liu, P., Tao, Q., & Zhou, J. T. (2024). Evolving from single-modal to multi-modal facial deepfake detection: A survey. arXiv preprint arXiv:2406.06965.
[Google Scholar]
Salvi, D., Liu, H., Mandelli, S., Bestagini, P., Zhou, W., Zhang, W., & Tubaro, S. (2023). A robust approach to multimodal deepfake detection. Journal of Imaging, 9(6), 122.
[CrossRef] [Google Scholar]
Haliassos, A., Vougioukas, K., Petridis, S., & Pantic, M. (2021, June). Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5037-5047). IEEE.
[CrossRef] [Google Scholar]
Bekheet, A. A., Ghoneim, A., & Khoriba, G. (2024, July). A Comprehensive Comparative Analysis of Deepfake Detection Techniques in Visual, Audio, and Audio-Visual Domains. In 2024 Intelligent Methods, Systems, and Applications (IMSA) (pp. 122-129). IEEE.
[CrossRef] [Google Scholar]
Yang, W., Zhou, X., Chen, Z., Guo, B., Ba, Z., Xia, Z., ... & Ren, K. (2023). Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18, 2015-2029.
[CrossRef] [Google Scholar]
Cozzolino, D., Rössler, A., Thies, J., Nießner, M., & Verdoliva, L. (2021, October). ID-Reveal: Identity-aware DeepFake Video Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 15088-15097). IEEE.
[CrossRef] [Google Scholar]
Zhao, H., Wei, T., Zhou, W., Zhang, W., Chen, D., & Yu, N. (2021, June). Multi-attentional Deepfake Detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2185-2194). IEEE.
[CrossRef] [Google Scholar]
Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020, October). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia (pp. 484-492).
[CrossRef] [Google Scholar]
Li, Y., & Lyu, S. (2018). Exposing DeepFake Videos By Detecting Face Warping Artifacts. arXiv preprint arXiv:1811.00656.
[Google Scholar]
Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023). Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied intelligence, 53(4), 3974-4026.
[CrossRef] [Google Scholar]
Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K., Hickerson, A., Wright, M., & Ptucha, R. (2020). Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE Journal of Selected Topics in Signal Processing, 14(5), 1024-1037.
[CrossRef] [Google Scholar]
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Niessner, M. (2019, October). FaceForensics++: Learning to Detect Manipulated Facial Images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 1-11). IEEE.
[CrossRef] [Google Scholar]
Ilyas, H., Javed, A., & Malik, K. M. (2023). AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio–visual deepfakes detection. Applied Soft Computing, 136, 110124.
[CrossRef] [Google Scholar]
Khan, A. A., Laghari, A. A., Inam, S. A., Ullah, S., Shahzad, M., & Syed, D. (2025). A survey on multimedia-enabled deepfake detection: state-of-the-art tools and techniques, emerging trends, current challenges & limitations, and future directions. Discover Computing, 28(1), 48.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Banik, B. G., & Naziya, S. N. (2026). Smart Deepfake Detector Using Audio-Visual Clues. ICCK Transactions on Machine Intelligence, 2(2), 100–105. https://doi.org/10.62762/TMI.2025.601369

Export Citation

RIS Format

Compatible with EndNote, Zotero, Mendeley, and other reference managers

RIS format data for reference managers

TY - JOUR
AU - Banik, Barnali Gupta
AU - Naziya, Shaik Nidha
PY - 2026
DA - 2026/03/01
TI - Intelligent Deepfake Detector Using Audio-Visual Clues
JO - ICCK Transactions on Machine Intelligence
T2 - ICCK Transactions on Machine Intelligence
JF - ICCK Transactions on Machine Intelligence
VL - 2
IS - 2
SP - 100
EP - 105
DO - 10.62762/TMI.2025.601369
UR - https://www.icck.org/article/abs/TMI.2025.601369
KW - deepfake detection
KW - multi-modal dispersion
KW - audio-visual clues
KW - cross-modal inconsistency
KW - lip-sync analysis
KW - AI forensics
KW - transformer fusion
AB - Deepfake media is growing rapidly and causing significant harm. Bad actors now use AI to create fake videos that appear increasingly realistic. Traditional detection tools often fail because they analyze audio or visual signals in isolation. This paper introduces an intelligent Deepfake Detection system that addresses this limitation through a novel Multi-Modal Dispersion Framework. The system identifies subtle inconsistencies by tracking how lip movements align with speech patterns. By projecting these features into a shared latent space, the model quantifies the semantic divergence between modalities. A transformer module then captures cross-modal context to detect fine-grained manipulation artifacts. Evaluated on the DFDC and FakeAVCeleb datasets, the system achieves 94.3% accuracy, demonstrating strong potential for real-time deployment. This framework provides a reliable approach to media authentication and contributes to advancing AI safety.
SN - 3068-7403
PB - Institute of Central Computation and Knowledge
LA - English
ER -

BibTeX Format

Compatible with LaTeX, BibTeX, and other reference managers

BibTeX format data for LaTeX and reference managers

@article{Banik2026Intelligen,
  author = {Barnali Gupta Banik and Shaik Nidha Naziya},
  title = {Intelligent Deepfake Detector Using Audio-Visual Clues},
  journal = {ICCK Transactions on Machine Intelligence},
  year = {2026},
  volume = {2},
  number = {2},
  pages = {100-105},
  doi = {10.62762/TMI.2025.601369},
  url = {https://www.icck.org/article/abs/TMI.2025.601369},
  abstract = {Deepfake media is growing rapidly and causing significant harm. Bad actors now use AI to create fake videos that appear increasingly realistic. Traditional detection tools often fail because they analyze audio or visual signals in isolation. This paper introduces an intelligent Deepfake Detection system that addresses this limitation through a novel Multi-Modal Dispersion Framework. The system identifies subtle inconsistencies by tracking how lip movements align with speech patterns. By projecting these features into a shared latent space, the model quantifies the semantic divergence between modalities. A transformer module then captures cross-modal context to detect fine-grained manipulation artifacts. Evaluated on the DFDC and FakeAVCeleb datasets, the system achieves 94.3\% accuracy, demonstrating strong potential for real-time deployment. This framework provides a reliable approach to media authentication and contributes to advancing AI safety.},
  keywords = {deepfake detection, multi-modal dispersion, audio-visual clues, cross-modal inconsistency, lip-sync analysis, AI forensics, transformer fusion},
  issn = {3068-7403},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 6

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Machine Intelligence

ISSN: 3068-7403 (Online)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

User

Unlimited Downloads

Complete Library Access

Membership Eligibility

Community Leadership Opportunities

Google Scholar

Crossref

Scopus

Web of Science