dc.description.abstract |
Anomaly detection is an integral part of a number of surveillance applications. However, most of the existing anomaly detection models are statically trained on pre-recorded data from a single source, thus making multiple assumptions about the surrounding environment. As a result, their usefulness is limited to controlled scenarios. In this paper, we fuse information from live streams of audio and video data to detect anomalies in the captured environment. We train a deep learning-based teacher-student network using video, image, and audio information. The pre-trained visual network in the teacher model distills its information to the image and audio networks in the student model. Features from image and audio networks are combined and compressed using principal component analysis. Thus, the teacher-student network produces an image-audio-based light-weight joint representation of the data. The data dynamics are learned in a multivariate adaptive Gaussian mixture model. Empirical results from two audio-visual datasets demonstrate the effectiveness of joint representation over single modalities in the adaptive anomaly detection framework. The proposed framework outperforms the state-of-the-art methods by an average of 15.00 % and 14.52 % in AUC values for dataset 1 and dataset 2, respectively. |
en_US |