Abstract:
We propose detection of deepfake videos based on the dissimilarity
between the audio and visual modalities, termed as the Modality
Dissonance Score (MDS). We hypothesize that manipulation of
either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc.
MDS is computed as an aggregate of dissimilarity scores between
audio and visual segments in a video. Discriminative features are
learnt for the audio and visual channels in a chunk-wise manner,
employing the cross-entropy loss for individual modalities, and a
contrastive loss that models inter-modality similarity. Extensive
experiments on the DFDC and DeepFake-TIMIT Datasets show
that our approach outperforms the state-of-the-art by up to 7%. We
also demonstrate temporal forgery localization, and show how our
technique identifies the manipulated video segments.