A Comprehensive Survey of DeepFake Generation and Detection Techniques in Audio-Visual Media

Iqra Khan; Kashif Khan; Arshad Ahmad

doi:10.62762/JIAP.2025.431672

CiteScore

Impact Factor

Volume 1, Issue 2, ICCK Journal of Image Analysis and Processing

Volume 1, Issue 2, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Research on A Ship Trajectory Classification Method Based on Deep Learning Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Inaugural Editorial of the Chinese Journal of Information Fusion YOLOv8-Lite: A Lightweight Object Detection Model for Real-time Autonomous Driving Systems Short and Long-Term Renewable Electricity Demand Forecasting Based on CNN-Bi-GRU Model Simultaneous Spatiotemporal Bias Compensation and Data Fusion for Asynchronous Multisensor Systems

ICCK Journal of Image Analysis and Processing, Volume 1, Issue 2, 2025: 73-95

Open Access | Review Article | 30 June 2025

A Comprehensive Survey of DeepFake Generation and Detection Techniques in Audio-Visual Media

Iqra Khan 1

Kashif Khan 2

Arshad Ahmad 3 *

1 Department of Computer and Software Technology, University of Swat, Swat 19130, Pakistan

2 Faculty of Computing, Universiti Malaysia Pahang Al-Sultan Abdullah, 26600 Pekan, Malaysia

3 Department of Computer Software Engineering, Military College of Signals (MCS), National University of Sciences and Technology (NUST), Rawalpindi 46000, Pakistan

* Corresponding Author: Arshad Ahmad, [email protected]

DOI: 10.62762/JIAP.2025.431672

Received: 07 May 2025, Accepted: 24 June 2025, Published: 30 June 2025

PDF (792.36 KB)

Article Metrics Cite This Article

Abstract

The rapid advancement in machine learning and artificial intelligence has significantly enhanced capabilities in multimedia content creation, particularly in the domain of deepfake generation. Deepfakes leverage complex neural networks to create hyper-realistic manipulated audio-visual content, raising profound ethical, societal, and security concerns. This paper presents a comprehensive survey of contemporary trends in deepfake video research, focusing on both generation and detection methodologies. The study categorizes deepfakes into three primary types: facial manipulation, lip-synchronization, and audio deepfakes, further subdividing them into face swapping, face generation, attribute manipulation, puppeteering, speech generation, and voice conversion. For each type, the paper reviews cutting-edge generation techniques, including StyleGANs, variational autoencoders, and various speech synthesis models. It also presents an in-depth analysis of detection methods, highlighting both traditional handcrafted feature-based approaches and modern deep learning frameworks utilizing CNNs, RNNs, attention mechanisms, and hybrid transformer models. The paper evaluates these methods in terms of performance, generalization, robustness, and limitations against evolving deepfake techniques. The survey identifies significant challenges such as vulnerability to adversarial attacks, lack of generalized models, and dependency on high-quality training data. The insights provided aim to aid researchers and practitioners in developing more robust detection mechanisms and understanding the landscape of deepfake threats and countermeasures. Ultimately, this study contributes to the growing body of literature by mapping current trends and suggesting potential avenues for future research in combating deepfake proliferation.

Graphical Abstract

Keywords

DeepFake

deep learning

facial manipulations

puppeteering

lip-synchronization

image processing

Data Availability Statement

Not applicable.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

FaceApp: Face Editor. Retrieved from https://www.faceapp.com/
[Google Scholar]
FakeApp. (2019, March 7). Malavida. Retrieved from https://www.malavida.com/en/soft/fakeapp/
[Google Scholar]
Deepfakes/faceswap: Deepfakes software for all. (n.d.). Retrieved from GitHub. https://github.com/deepfakes/faceswap
[Google Scholar]
Iperov/DeepFaceLab: DeepFaceLab is the leading software for creating deepfakes. (n.d.). GitHub. Retrieved from https://github.com/iperov/DeepFaceLab
[Google Scholar]
Ratnam, G. (2020, March 2). How fake audio, like deepfakes, could plague business, politics. Roll Call. Retrieved from https://rollcall.com/2020/03/03/how-fake-audio-like-deepfakes-could-plague-business-politics/
[Google Scholar]
The LJ speech dataset. (n.d.). Kaggle: Your Machine Learning and Data Science Community. Retrieved from https://www.kaggle.com/datasets/mathurinache/the-lj-speech-dataset.
[Google Scholar]
(n.d.). Reface. Retrieved from https://reface.ai/
[Google Scholar]
ZAO Yunifood apps on the app store. (n.d.). App Store. Retrieved from https://apps.apple.com/by/developer/zao-yunifood/id1450822231
[Google Scholar]
Afchar, D., Nozick, V., Yamagishi, J., & Echizen, I. (2018, December). Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS) (pp. 1-7). IEEE.
[CrossRef] [Google Scholar]
Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2018). Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12), 8717-8727.
[CrossRef] [Google Scholar]
Afouras, T., Chung, J. S., & Zisserman, A. (2018). LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496.
[CrossRef] [Google Scholar]
Agarwal, S., Farid, H., El-Gaaly, T., & Lim, S. N. (2020, December). Detecting deep-fake videos from appearance and behavior. In 2020 IEEE international workshop on information forensics and security (WIFS) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Agarwal, S., Farid, H., Gu, Y., He, M., Nagano, K., & Li, H. (2019, June). Protecting world leaders against deep fakes. In CVPR workshops (Vol. 1, No. 38).
[Google Scholar]
Agarwal, S., Hu, L., Ng, E., Darrell, T., Li, H., & Rohrbach, A. (2023). Watch those words: Video falsification detection using word-conditioned facial motion. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 4710-4719).
[CrossRef] [Google Scholar]
Akhtar, Z., Mouree, M. R., & Dasgupta, D. (2020, September). Utility of deep learning features for facial attributes manipulation detection. In 2020 IEEE International Conference on Humanized Computing and Communication with Artificial Intelligence (HCCAI) (pp. 55-60). IEEE.
[CrossRef] [Google Scholar]
Albahar, M., & Almalki, J. (2019). Deepfakes: Threats and countermeasures systematic review. Journal of Theoretical and Applied Information Technology, 97(22), 3242-3250.
[Google Scholar]
Aljasem, M., Irtaza, A., Malik, H., Saba, N., Javed, A., Malik, K. M., & Meharmohammadi, M. (2021). Secure automatic speaker verification (SASV) system through sm-ALTP features and asymmetric bagging. IEEE Transactions on Information Forensics and Security, 16, 3524-3537.
[CrossRef] [Google Scholar]
Amerini, I., Galteri, L., Caldelli, R., & Del Bimbo, A. (2019). Deepfake video detection through optical flow based cnn. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 0-0).
[CrossRef] [Google Scholar]
Arik, S., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. Advances in neural information processing systems, 31.
[CrossRef] [Google Scholar]
Arık, S. Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., ... & Shoeybi, M. (2017, July). Deep voice: Real-time neural text-to-speech. In International conference on machine learning (pp. 195-204). PMLR.
[CrossRef] [Google Scholar]
Baas, M., & Kamper, H. (2024). Disentanglement in a GAN for unconditional speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 1324-1335.
[CrossRef] [Google Scholar]
Bao, H., Zhang, X., Wang, Q., Liang, K., Wang, Z., Ji, S., & Chen, W. (2024). MILG: Realistic lip-sync video generation with audio-modulated image inpainting. Visual Informatics, 8(3), 71-81.
[CrossRef] [Google Scholar]
Barni, M., Kallas, K., Nowroozi, E., & Tondi, B. (2020, December). CNN detection of GAN-generated face images based on cross-band co-occurrences analysis. In 2020 IEEE international workshop on information forensics and security (WIFS) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Bird, J. J., & Lotfi, A. (2023). Real-time detection of ai-generated speech for deepfake voice conversion. arXiv preprint arXiv:2308.12734.
[CrossRef] [Google Scholar]
Bohacek, M., & Farid, H. (2024). Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4315-4323).
[CrossRef] [Google Scholar]
Bonettini, N., Cannas, E. D., Mandelli, S., Bondi, L., Bestagini, P., & Tubaro, S. (2021, January). Video face manipulation detection through ensemble of cnns. In 2020 25th international conference on pattern recognition (ICPR) (pp. 5012-5019). IEEE.
[CrossRef] [Google Scholar]
Boyd, A., Tinsley, P., Bowyer, K. W., & Czajka, A. (2023). Cyborg: Blending human saliency into the loss improves deep learning-based synthetic face detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 6108-6117).
[CrossRef] [Google Scholar]
Peele, J. (2018). You won’t believe what obama says in this video. Youtube.
[Google Scholar]
Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018, May). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) (pp. 67-74). IEEE.
[CrossRef] [Google Scholar]
Chen, B., Ju, X., Xiao, B., Ding, W., Zheng, Y., & de Albuquerque, V. H. C. (2021). Locally GAN-generated face detection based on an improved Xception. Information Sciences, 572, 16-28.
[CrossRef] [Google Scholar]
Chen, L., Cui, G., Kou, Z., Zheng, H., & Xu, C. (2020). What comprises a good talking-head video generation?: A survey and benchmark. arXiv preprint arXiv:2005.03201.
[CrossRef] [Google Scholar]
Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (2019, June). Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7824-7833). IEEE.
[CrossRef] [Google Scholar]
Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K., Hickerson, A., Wright, M., & Ptucha, R. (2020). Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE Journal of Selected Topics in Signal Processing, 14(5), 1024-1037.
[CrossRef] [Google Scholar]
Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8789-8797).
[CrossRef] [Google Scholar]
Coccomini, D. A., Messina, N., Gennaro, C., & Falchi, F. (2022, May). Combining efficientnet and vision transformers for video deepfake detection. In International conference on image analysis and processing (pp. 219-229). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Dang, L. M., Hassan, S. I., Im, S., Lee, J., Lee, S., & Moon, H. (2018). Deep learning based computer generated face identification using convolutional neural network. Applied Sciences, 8(12), 2610.
[CrossRef] [Google Scholar]
Das, R. K., Yang, J., & Li, H. (2019). Long Range Acoustic Features for Spoofed Speech Detection. In Proc. Interspeech 2019 (pp. 1058-1062).
[CrossRef] [Google Scholar]
Datta, S. K., Jia, S., & Lyu, S. (2024, July). Exposing lip-syncing deepfakes from mouth inconsistencies. In 2024 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Datta, S. K., Jia, S., & Lyu, S. (2025). Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies. arXiv preprint arXiv:2504.01470.
[CrossRef] [Google Scholar]
Dawood, H., Saleem, S., Hassan, F., & Javed, A. (2022). A robust voice spoofing detection system using novel CLS-LBP features and LSTM. Journal of King Saud University-Computer and Information Sciences, 34(9), 7300-7312.
[CrossRef] [Google Scholar]
Deng, J., Chen, Y., Zhong, Y., Miao, Q., Gong, X., & Xu, W. (2023). Catch you and i can: Revealing source voiceprint against voice conversion. In 32nd USENIX Security Symposium (USENIX Security 23) (pp. 5163-5180).
[CrossRef] [Google Scholar]
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., & Ferrer, C. C. (2020). The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397.
[CrossRef] [Google Scholar]
Gao, W., Cao, B., Shan, S., Chen, X., Zhou, D., Zhang, X., & Zhao, D. (2007). The CAS-PEAL large-scale Chinese face database and baseline evaluations. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(1), 149-161.
[CrossRef] [Google Scholar]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139-144.
[CrossRef] [Google Scholar]
Guan, J., Zhang, Z., Zhou, H., Hu, T., Wang, K., He, D., ... & Wang, J. (2023). Stylesync: High-fidelity generalized and personalized lip sync in style-based generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1505-1515).
[CrossRef] [Google Scholar]
Guarnera, L., Giudice, O., & Battiato, S. (2020). Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 666-667).
[CrossRef] [Google Scholar]
Güera, D., Baireddy, S., Bestagini, P., Tubaro, S., & Delp, E. J. (2019). We need no pixels: Video manipulation detection using stream descriptors. arXiv preprint arXiv:1906.08743.
[CrossRef] [Google Scholar]
Güera, D., & Delp, E. J. (2018, November). Deepfake video detection using recurrent neural networks. In 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Guo, Y., Chen, K., Liang, S., Liu, Y. J., Bao, H., & Zhang, J. (2021). Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5784-5794).
[CrossRef] [Google Scholar]
Guo, Z., Yang, G., Zhang, D., & Xia, M. (2023). Rethinking gradient operator for exposing AI-enabled face forgeries. Expert Systems with Applications, 215, 119361.
[CrossRef] [Google Scholar]
Haliassos, A., Vougioukas, K., Petridis, S., & Pantic, M. (2021, June). Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5037-5047). IEEE.
[CrossRef] [Google Scholar]
Harwell, D. (2019). An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft. Washington Post, 4.
[Google Scholar]
Harwell, D. (2021). Remember the ‘deepfake cheerleader mom’? Prosecutors now admit they can’t prove fake-video claims. March, 14, 2021.
[Google Scholar]
Huang, B., Wang, Z., Yang, J., Ai, J., Zou, Q., Wang, Q., & Ye, D. (2023, June). Implicit Identity Driven Deepfake Face Swapping Detection. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4490-4499). IEEE.
[CrossRef] [Google Scholar]
Huang, G. B., Mattar, M. A., Lee, H., & Learned-Miller, E. (2012, December). Learning to align from scratch. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 1 (pp. 764-772).
[CrossRef] [Google Scholar]
Huang, G. B., Mattar, M., Berg, T., & Learned-Miller, E. (2008). Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in'Real-Life'Images: detection, alignment, and recognition.
[Google Scholar]
Ilyas, H., Irtaza, A., Javed, A., & Malik, K. M. (2022, December). Deepfakes examiner: An end-to-end deep learning model for deepfakes videos detection. In 2022 16th international conference on open source systems and technologies (ICOSST) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Ismail, A., Elpeltagy, M., Zaki, M., & ElDahshan, K. A. (2021). Deepfake video detection: YOLO-Face convolution recurrent approach. PeerJ Computer Science, 7, e730.
[CrossRef] [Google Scholar]
Jafar, M. T., Ababneh, M., Al-Zoube, M., & Elhassan, A. (2020, April). Forensics and analysis of deepfake videos. In 2020 11th international conference on information and communication systems (ICICS) (pp. 053-058). IEEE.
[CrossRef] [Google Scholar]
Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C. C., Cao, X., & Xu, F. (2021). Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14080-14089).
[CrossRef] [Google Scholar]
Jiang, D., Song, D., Tong, R., & Tang, M. (2023, June). StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 352-361). IEEE.
[CrossRef] [Google Scholar]
Johnson, D., & Johnson, A. (2023). What are deepfakes? How fake AI-powered audio and video warps our perception of reality. How fake AI-powered audio and video warps our perception of reality.
[Google Scholar]
Kang, M., Han, W., Hwang, S. J., & Yang, E. (2023). ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models. In Proc. Interspeech 2023 (pp. 4339-4343).
[CrossRef] [Google Scholar]
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
[CrossRef] [Google Scholar]
Karras, T., Laine, S., & Aila, T. (2019, June). A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4396-4405). IEEE.
[CrossRef] [Google Scholar]
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8110-8119).
[CrossRef] [Google Scholar]
Khan, A., & Malik, K. M. (2023, June). Spotnet: A spoofing-aware transformer network for effective synthetic speech detection. In Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation (pp. 10-18).
[CrossRef] [Google Scholar]
Nightingale, S. J., & Wade, K. A. (2022). Identifying and minimising the impact of fake visual media: Current and future directions. Memory, Mind & Media, 1, e15.
[CrossRef] [Google Scholar]
Khoo, B., Phan, R. C. W., & Lim, C. H. (2022). Deepfake attribution: On the source identification of artificially generated images. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3), e1438.
[CrossRef] [Google Scholar]
Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023). Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied intelligence, 53(4), 3974-4026.
[CrossRef] [Google Scholar]
Gupta, G., Raja, K., Gupta, M., Jan, T., Whiteside, S. T., & Prasad, M. (2023). A comprehensive review of deepfake detection using advanced machine learning and fusion methods. Electronics, 13(1), 95.
[CrossRef] [Google Scholar]
Khan, S. A., & Dang-Nguyen, D. T. (2022, September). Hybrid transformer network for deepfake detection. In Proceedings of the 19th international conference on content-based multimedia indexing (pp. 8-14).
[CrossRef] [Google Scholar]
Sharma, V. K., Garg, R., & Caudron, Q. (2024). A systematic literature review on deepfake detection techniques. Multimedia Tools and Applications, 1-43.
[CrossRef] [Google Scholar]
Kharbat, F. F., Elamsy, T., Mahmoud, A., & Abdullah, R. (2019, November). Image feature detectors for deepfake video detection. In 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA) (pp. 1-4). IEEE.
[CrossRef] [Google Scholar]
Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.
[Google Scholar]
Kim, J., Lee, J., & Zhang, B. T. (2022, June). Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10769-10778). IEEE.
[CrossRef] [Google Scholar]
Kim, K. W., Park, S. W., Lee, J., & Joe, M. C. (2022, May). Assem-vc: Realistic voice conversion by assembling modern speech synthesis techniques. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6997-7001). IEEE.
[CrossRef] [Google Scholar]
Kim, M., Liu, F., Jain, A., & Liu, X. (2023). Dcface: Synthetic face generation with dual condition diffusion model. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12715-12725).
[CrossRef] [Google Scholar]
Kim, M., Tariq, S., & Woo, S. S. (2021). Fretal: Generalizing deepfake detection using knowledge distillation and representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1001-1012).
[CrossRef] [Google Scholar]
Kingma, D. P., & Welling, M. (2013, December). Auto-encoding variational bayes.
[Google Scholar]
Koizumi, Y., Zen, H., Karita, S., Ding, Y., Yatabe, K., Morioka, N., ... & Bapna, A. (2023). Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802.
[CrossRef] [Google Scholar]
Koopman, M., Rodriguez, A. M., & Geradts, Z. (2018, August). Detection of deepfake video manipulation. In The 20th Irish machine vision and image processing conference (IMVIP) (pp. 133-136).
[Google Scholar]
Korshunov, P., & Marcel, S. (2018, September). Speaker inconsistency detection in tampered video. In 2018 26th European signal processing conference (EUSIPCO) (pp. 2375-2379). IEEE.
[CrossRef] [Google Scholar]
Kumar, D. A., & Priyanka, K. (2025, January). Enhancing Emotional Voice Conversion with Intensity Control and Mixed Embedding. In 2025 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS) (pp. 1-8). IEEE.
[CrossRef] [Google Scholar]
Kumar, P., Vatsa, M., & Singh, R. (2020). Detecting face2face facial reenactment in videos. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2589-2597).
[CrossRef] [Google Scholar]
Kurematsu, A., Takeda, K., Sagisaka, Y., Katagiri, S., Kuwabara, H., & Shikano, K. (1990). ATR Japanese speech database as a tool of speech recognition and synthesis. Speech communication, 9(4), 357-363.
[CrossRef] [Google Scholar]
Lee, C. H., Liu, Z., Wu, L., & Luo, P. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5549-5558).
[CrossRef] [Google Scholar]
Li, S., Liu, L., Liu, J., Song, W., Hao, A., & Qin, H. (2023). SC-GAN: Subspace clustering based GAN for automatic expression manipulation. Pattern Recognition, 134, 109072.
[CrossRef] [Google Scholar]
Li, X., Li, N., Weng, C., Liu, X., Su, D., Yu, D., & Meng, H. (2021, June). Replay and synthetic speech detection with res2net architecture. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6354-6358). IEEE.
[CrossRef] [Google Scholar]
Li, Y., Chang, M. C., & Lyu, S. (2018, December). In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International workshop on information forensics and security (WIFS) (pp. 1-7). IEEE.
[CrossRef] [Google Scholar]
Li, Y., & Lyu, S. (2018). Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656.
[CrossRef] [Google Scholar]
Liang, P., Liu, G., Xiong, Z., Fan, H., Zhu, H., & Zhang, X. (2023). A facial geometry based detection model for face manipulation using CNN-LSTM architecture. Information Sciences, 633, 370-383.
[CrossRef] [Google Scholar]
Liu, R., Zhang, J., Gao, G., & Li, H. (2023). Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion. arXiv preprint arXiv:2305.16353.
[CrossRef] [Google Scholar]
Liu, Z., Luo, P., Wang, X., & Tang, X. (2018). Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018), 11.
[Google Scholar]
Lyth, D., & King, S. (2024). Natural language guidance of high-fidelity text-to-speech with synthetic annotations. arXiv preprint arXiv:2402.01912.
[CrossRef] [Google Scholar]
Ma, H., Yi, J., Tao, J., Bai, Y., Tian, Z., & Wang, C. (2021). Continual learning for fake audio detection. arXiv preprint arXiv:2104.07286.
[CrossRef] [Google Scholar]
Ma, X., Zhang, R., Wei, J., Lu, X., Xu, J., Zhang, L., & Lu, W. (2025). Self-distillation-based domain exploration for source speaker verification under spoofed speech from unknown voice conversion. Speech Communication, 167, 103153.
[CrossRef] [Google Scholar]
Mandelli, S., Bonettini, N., Bestagini, P., & Tubaro, S. (2022, October). Detecting gan-generated images by orthogonal training of multiple cnns. In 2022 IEEE International Conference on Image Processing (ICIP) (pp. 3091-3095). IEEE.
[CrossRef] [Google Scholar]
Matern, F., Riess, C., & Stamminger, M. (2019, January). Exploiting visual artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW) (pp. 83-92). IEEE.
[CrossRef] [Google Scholar]
Mazaheri, G., & Roy-Chowdhury, A. K. (2022). Detection and localization of facial expression manipulations. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1035-1045).
[CrossRef] [Google Scholar]
McCloskey, S., & Albright, M. (2018). Detecting gan-generated imagery using color cues. arXiv preprint arXiv:1812.08247.
[CrossRef] [Google Scholar]
Messer, K., Matas, J., Kittler, J., Luettin, J., & Maitre, G. (1999, March). XM2VTSDB: The extended M2VTS database. In Second international conference on audio and video-based biometric person authentication (Vol. 964, pp. 965-966).
[Google Scholar]
Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18-31.
[CrossRef] [Google Scholar]
Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., & Zafeiriou, S. (2017). Agedb: the first manually collected, in-the-wild age database. In proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 51-59).
[CrossRef] [Google Scholar]
Müller, N. M., Czempin, P., Dieckmann, F., Froghyar, A., & Böttinger, K. (2022). Does audio deepfake detection generalize?. arXiv preprint arXiv:2203.16263.
[CrossRef] [Google Scholar]
Nagahara, S., Katayama, T., Song, T., & Shimamoto, T. (2022, July). A Novel Video Coding Framework with GAN-based Face Generation for Videoconferencing. In 2022 37th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC) (pp. 450-452). IEEE.
[CrossRef] [Google Scholar]
Nagrani, A., Chung, J. S., & Zisserman, A. (2017). Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.
[CrossRef] [Google Scholar]
Nataraj, L., Mohammed, T. M., Chandrasekaran, S., Flenner, A., Bappy, J. H., Roy-Chowdhury, A. K., & Manjunath, B. S. (2019). Detecting GAN generated fake images using co-occurrence matrices. arXiv preprint arXiv:1903.06836.
[CrossRef] [Google Scholar]
Nawaz, M., Javed, A., & Irtaza, A. (2024). A deep learning model for FaceSwap and face-reenactment deepfakes detection. Applied Soft Computing, 162, 111854.
[CrossRef] [Google Scholar]
Nguyen, H. H., Fang, F., Yamagishi, J., & Echizen, I. (2019, September). Multi-task learning for detecting and segmenting manipulated facial images and videos. In 2019 IEEE 10th international conference on biometrics theory, applications and systems (BTAS) (pp. 1-8). IEEE.
[CrossRef] [Google Scholar]
Kwak, J. G., Han, D. K., & Ko, H. (2020). CAFE-GAN: Arbitrary face attribute editing with complementary attention feature. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 (pp. 524-540). Springer International Publishing.
[CrossRef] [Google Scholar]
Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[CrossRef] [Google Scholar]
Pal, M., Paul, D., & Saha, G. (2018). Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech & Language, 48, 31-50.
[CrossRef] [Google Scholar]
Pal, M., Raikar, A., Panda, A., & Kopparapu, S. K. (2022). Synthetic speech detection using meta-learning with prototypical loss. arXiv preprint arXiv:2201.09470.
[CrossRef] [Google Scholar]
Park, S. J., Kim, M., Hong, J., Choi, J., & Ro, Y. M. (2022, June). Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 2, pp. 2062-2070).
[CrossRef] [Google Scholar]
Parkhi, O., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC 2015-Proceedings of the British Machine Vision Conference 2015. British Machine Vision Association.
[Google Scholar]
Pernuš, M., Štruc, V., & Dobrišek, S. (2023). MaskFaceGAN: High-resolution face editing with masked GAN latent code optimization. IEEE Transactions on Image Processing, 32, 5893-5908.
[CrossRef] [Google Scholar]
Phillips, P. J., Flynn, P. J., Scruggs, T., Bowyer, K. W., Chang, J., Hoffman, K., ... & Worek, W. (2005, June). Overview of the face recognition grand challenge. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (Vol. 1, pp. 947-954). IEEE.
[CrossRef] [Google Scholar]
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., & Kudinov, M. (2021, July). Grad-tts: A diffusion probabilistic model for text-to-speech. In International conference on machine learning (pp. 8599-8608). PMLR.
[Google Scholar]
Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020, October). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia (pp. 484-492).
[CrossRef] [Google Scholar]
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020). Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411.
[CrossRef] [Google Scholar]
Rana, M. S., & Sung, A. H. (2020, August). Deepfakestack: A deep ensemble-based learning technique for deepfake detection. In 2020 7th IEEE international conference on cyber security and cloud computing (CSCloud)/2020 6th IEEE international conference on edge computing and scalable cloud (EdgeCom) (pp. 70-75). IEEE.
[CrossRef] [Google Scholar]
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2019). Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1-11).
[CrossRef] [Google Scholar]
Saito, Y., Takamichi, S., Iimori, E., Tachibana, K., & Saruwatari, H. (2023). Chatgpt-edss: Empathetic dialogue speech synthesis trained from chatgpt-derived context word embeddings. arXiv preprint arXiv:2305.13724.
[CrossRef] [Google Scholar]
Salvi, D., Bestagini, P., & Tubaro, S. (2023, June). Synthetic speech detection through audio folding. In Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation (pp. 3-9).
[CrossRef] [Google Scholar]
Sanabria, R., Bogoychev, N., Markl, N., Carmantini, A., Klejch, O., & Bell, P. (2023, June). The edinburgh international accents of english corpus: Towards the democratization of english asr. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[CrossRef] [Google Scholar]
Sarfjoo, S. S., Wang, X., Henter, G. E., Lorenzo-Trueba, J., Takaki, S., & Yamagishi, J. (2019). Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model. arXiv preprint arXiv:1911.03952.
[CrossRef] [Google Scholar]
Scherhag, U., Debiasi, L., Rathgeb, C., Busch, C., & Uhl, A. (2019). Detection of face morphing attacks based on PRNU analysis. IEEE Transactions on Biometrics, Behavior, and Identity Science, 1(4), 302-317.
[CrossRef] [Google Scholar]
Sengupta, S., Chen, J. C., Castillo, C., Patel, V. M., Chellappa, R., & Jacobs, D. W. (2016, March). Frontal to profile face verification in the wild. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1-9). IEEE.
[CrossRef] [Google Scholar]
Shahzad, S. A., Hashmi, A., Khan, S., Peng, Y. T., Tsao, Y., & Wang, H. M. (2022, November). Lip sync matters: A novel multimodal forgery detector. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1885-1892). IEEE.
[CrossRef] [Google Scholar]
Shao, M., Lu, L., Ding, Y., & Liao, Q. (2023, June). Minimising Distortion for GAN-Based Facial Attribute Manipulation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[CrossRef] [Google Scholar]
Sisman, B., Yamagishi, J., King, S., & Li, H. (2020). An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 132-157.
[CrossRef] [Google Scholar]
Solanki, G. K., & Roussos, A. (2022, October). Deep semantic manipulation of facial videos. In European conference on computer vision (pp. 104-120). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622.
[CrossRef] [Google Scholar]
Sun, P., Qi, H., Li, Y., & Lyu, S. (2024). FakeTracer: Catching Face-swap DeepFakes via Implanting Traces in Training. IEEE Transactions on Emerging Topics in Computing, 13(1), 134-146.
[CrossRef] [Google Scholar]
Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4), 1-13.
[CrossRef] [Google Scholar]
Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., ... & Liu, T. Y. (2024). Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6), 4234-4245.
[CrossRef] [Google Scholar]
Tan, X., Qin, T., Soong, F., & Liu, T. Y. (2021). A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561.
[CrossRef] [Google Scholar]
Tanaka, K., Kameoka, H., & Kaneko, T. (2023). PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder. In 12th Speech Synthesis Workshop (SSW) 2023.
[Google Scholar]
Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., ... & Wang, J. (2025). Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. International Journal of Computer Vision, 1-12.
[CrossRef] [Google Scholar]
Thies, J., Zollhöfer, M., & Nießner, M. (2019). Deferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4), 1-12.
[CrossRef] [Google Scholar]
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., & Nießner, M. (2016). Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2387-2395).
[CrossRef] [Google Scholar]
Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., ... & Lee, K. A. (2019). ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441.
[CrossRef] [Google Scholar]
Vecino, B. T., Pomirski, A., Iddon, T., Cotescu, M., & Lorenzo-Trueba, J. (2025). Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications. arXiv preprint arXiv:2505.07701.
[CrossRef] [Google Scholar]
Vieira, T. F., Bottino, A., Laurentini, A., & De Simone, M. (2014). Detecting siblings in image pairs. The Visual Computer, 30, 1333-1345.
[CrossRef] [Google Scholar]
Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., ... & Dupoux, E. (2021). VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390.
[CrossRef] [Google Scholar]
Wang, F., Xiang, S., Liu, T., & Fu, Y. (2021, July). Attention based facial expression manipulation. In 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Wang, J., Tondi, B., & Barni, M. (2022). An eyes-based siamese neural network for the detection of gan-generated face images. Frontiers in Signal Processing, 2, 918725.
[CrossRef] [Google Scholar]
Wang, J., Tondi, B., & Barni, M. (2023, June). Classification of synthetic facial attributes by means of hybrid classification/localization patch-based analysis. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[CrossRef] [Google Scholar]
Wang, K., Dunn, E., Rodriguez, M., & Frahm, J. M. (2017). Efficient video collection association using geometry-aware Bag-of-Iconics representations. IPSJ Transactions on Computer Vision and Applications, 9, 1-17.
[CrossRef] [Google Scholar]
Wang, R., Juefei-Xu, F., Huang, Y., Guo, Q., Xie, X., Ma, L., & Liu, Y. (2020, October). Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. In Proceedings of the 28th ACM international conference on multimedia (pp. 1207-1216).
[CrossRef] [Google Scholar]
Wang, R., Juefei-Xu, F., Ma, L., Xie, X., Huang, Y., Wang, J., & Liu, Y. (2019). Fakespotter: A simple yet robust baseline for spotting ai-synthesized fake faces. arXiv preprint arXiv:1909.06122.
[CrossRef] [Google Scholar]
Wang, T., Cheng, H., Zhang, X., & Wang, Y. (2025). NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping. arXiv preprint arXiv:2503.18678.
[CrossRef] [Google Scholar]
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[CrossRef] [Google Scholar]
Waseem, S., Abu-Bakar, S. A. R. S., Omar, Z., Ahmed, B. A., Baloch, S., & Hafeezallah, A. (2023). Multi-attention-based approach for deepfake face and expression swap detection and localization. EURASIP Journal on Image and Video Processing, 2023(1), 14.
[CrossRef] [Google Scholar]
Wodajo, D., & Atnafu, S. (2021). Deepfake video detection using convolutional vision transformer. arXiv preprint arXiv:2102.11126.
[CrossRef] [Google Scholar]
Wu, H., Kuo, H. C., Zheng, N., Hung, K. H., Lee, H. Y., Tsao, Y., ... & Meng, H. (2022, May). Partially fake audio detection by self-attention-based fake span discovery. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 9236-9240). IEEE.
[CrossRef] [Google Scholar]
Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., & Zhou, Q. (2018). Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2129-2138).
[CrossRef] [Google Scholar]
Wu, X., Hu, P., Wu, Y., Lyu, X., Cao, Y. P., Shan, Y., ... & Qi, X. (2023). Speech2lip: High-fidelity speech to lip generation by learning from a short video. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22168-22177).
[CrossRef] [Google Scholar]
Wu, Z., Das, R. K., Yang, J., & Li, H. (2020). Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. arXiv preprint arXiv:2009.09637.
[CrossRef] [Google Scholar]
Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilçi, C., Sahidullah, M., & Sizov, A. (2015, September). ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In INTERSPEECH 2015, Automatic Speaker Verification Spoofing and Countermeasures Challenge, colocated with INTERSPEECH 2015 (pp. 2037-2041). ISCA.
[CrossRef] [Google Scholar]
Xiao, L., & Wang, Z. (2018, August). Dense convolutional recurrent neural network for generalized speech animation. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 633-638). IEEE.
[CrossRef] [Google Scholar]
Xu, Z., Hong, Z., Ding, C., Zhu, Z., Han, J., Liu, J., & Ding, E. (2022, June). Mobilefaceswap: A lightweight framework for video face swapping. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 3, pp. 2973-2981).
[CrossRef] [Google Scholar]
Xu, Z., Zhou, H., Hong, Z., Liu, Z., Liu, J., Guo, Z., ... & Wang, J. (2022, October). Styleswap: Style-based generator empowers robust face swapping. In European Conference on Computer Vision (pp. 661-677). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Xue, Z., Jiang, X., Liu, Q., & Wei, Z. (2023). Global–local facial fusion based GAN generated fake face detection. Sensors, 23(2), 616.
[CrossRef] [Google Scholar]
Yadav, A. K. S., Xiang, Z., Bartusiak, E. R., Bestagini, P., Tubaro, S., & Delp, E. J. (2023, June). ASSD: Synthetic Speech Detection in the AAC Compressed Domain. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[CrossRef] [Google Scholar]
Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., ... & Delgado, H. (2021). ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537.
[CrossRef] [Google Scholar]
Yang, J., Das, R. K., & Li, H. (2018, November). Extended constant-Q cepstral coefficients for detection of spoofing attacks. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1024-1029). IEEE.
[CrossRef] [Google Scholar]
Yang, J., Das, R. K., & Li, H. (2019). Significance of subband features for synthetic speech detection. IEEE Transactions on Information Forensics and Security, 15, 2160-2170.
[CrossRef] [Google Scholar]
Yang, J., Zhou, Y., & Huang, H. (2023). Mel-s3r: Combining mel-spectrogram and self-supervised speech representation with vq-vae for any-to-any voice conversion. Speech Communication, 151, 52-63.
[CrossRef] [Google Scholar]
Yang, W., Zhou, X., Chen, Z., Guo, B., Ba, Z., Xia, Z., ... & Ren, K. (2023). Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18, 2015-2029.
[CrossRef] [Google Scholar]
Yang, X., Li, Y., Qi, H., & Lyu, S. (2019, July). Exposing GAN-synthesized faces using landmark locations. In Proceedings of the ACM workshop on information hiding and multimedia security (pp. 113-118).
[CrossRef] [Google Scholar]
Ye, Z., Jiang, Z., Ren, Y., Liu, J., He, J., & Zhao, Z. (2023). Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430.
[CrossRef] [Google Scholar]
Yi, J., Bai, Y., Tao, J., Ma, H., Tian, Z., Wang, C., ... & Fu, R. (2021). Half-truth: A partially fake audio detection dataset. arXiv preprint arXiv:2104.03617.
[CrossRef] [Google Scholar]
Yi, J., Fu, R., Tao, J., Nie, S., Ma, H., Wang, C., ... & Li, H. (2022, May). Add 2022: the first audio deep synthesis detection challenge. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 9216-9220). IEEE.
[CrossRef] [Google Scholar]
Yi, J., Tao, J., Fu, R., Yan, X., Wang, C., Wang, T., ... & Li, H. (2023). Add 2023: the second audio deepfake detection challenge. arXiv preprint arXiv:2305.13774.
[CrossRef] [Google Scholar]
Yin, F., Zhang, Y., Cun, X., Cao, M., Fan, Y., Wang, X., ... & Yang, Y. (2022, October). Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In European conference on computer vision (pp. 85-101). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Yoo, D., Lee, H., & Kim, J. (2024). Inversion based Face Swapping with Diffusion Model. IEEE Access, 13, 6764-6774.
[CrossRef] [Google Scholar]
Yoo, S. M., Choi, T. M., Choi, J. W., & Kim, J. H. (2023). FastSwap: A lightweight one-stage framework for real-time face swapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3558-3567).
[CrossRef] [Google Scholar]
Yu, Y., Ni, R., Li, W., & Zhao, Y. (2022). Detection of AI-manipulated fake faces via mining generalized features. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(4), 1-23.
[CrossRef] [Google Scholar]
Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., ... & Wu, Y. (2019). Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882.
[CrossRef] [Google Scholar]
Zhang, M., Zhou, Y., Ren, Y., Zhang, C., Yin, X., & Li, H. (2024). Refxvc: Cross-lingual voice conversion with enhanced reference leveraging. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 4146-4156.
[CrossRef] [Google Scholar]
Li, S., Wu, S., Xiang, S., Zhang, Y., Guerrero, J. M., & Vasquez, J. C. (2020). Research on synchronverter-based regenerative braking energy feedback system of urban rail transit. Energies, 13(17), 4418.
[CrossRef] [Google Scholar]
Zhang, Y., Zheng, L., & Thing, V. L. (2017, August). Automated face swapping and its detection. In 2017 IEEE 2nd international conference on signal and image processing (ICSIP) (pp. 15-19). IEEE.
[CrossRef] [Google Scholar]
Zhang, Z., Li, L., Ding, Y., & Fan, C. (2021). Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3661-3670).
[CrossRef] [Google Scholar]
Zhao, Y., Huang, W. C., Tian, X., Yamagishi, J., Das, R. K., Kinnunen, T., ... & Toda, T. (2020). Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527.
[CrossRef] [Google Scholar]
Zheng, R., Song, B., & Ji, C. (2021, June). Learning pose-adaptive lip sync with cascaded temporal convolutional network. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4255-4259). IEEE.
[CrossRef] [Google Scholar]
Zheng, T., & Deng, W. (2018). Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments. Beijing University of Posts and Telecommunications, Tech. Rep, 5(7), 5.
[Google Scholar]
Zheng, T., Deng, W., & Hu, J. (2017). Cross-age lfw: A database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197.
[CrossRef] [Google Scholar]
Zhong, W., Li, J., Cai, Y., Lin, L., & Li, G. (2024). Style-preserving lip sync via audio-aware style reference. arXiv preprint arXiv:2408.05412.
[CrossRef] [Google Scholar]
Zhou, Q., Zhou, Z., Bao, Z., Niu, W., & Liu, Y. (2024). IIN-FFD: Intra-Inter Network for Face Forgery Detection. Tsinghua Science and Technology, 29(6), 1839-1850.
[CrossRef] [Google Scholar]
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., & Li, D. (2020). Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6), 1-15.
[CrossRef] [Google Scholar]
Zhou, Y., & Lim, S. N. (2021). Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14800-14809).
[CrossRef] [Google Scholar]
Zhu, Y., Zhao, W., Tang, Y., Rao, Y., Zhou, J., & Lu, J. (2024). Stableswap: stable face swapping in a shared and controllable latent space. IEEE Transactions on Multimedia, 26, 7594-7607.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Khan, I., Khan, K., & Ahmad, A. (2025). A Comprehensive Survey of DeepFake Generation and Detection Techniques in Audio-Visual Media. ICCK Journal of Image Analysis and Processing, 1(2), 73–95. https://doi.org/10.62762/JIAP.2025.431672

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 36

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Copyright © 2025 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

ICCK Journal of Image Analysis and Processing

ISSN: 3068-6679 (Online)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies