dc.description.abstract |
Surveillance is the process of continuously monitoring an area, a person, or a group
of people for ongoing activity in order to gather meaningful information. Video-based
surveillance (such as CCTV cameras) is the most prevalent form of surveillance. CCTV
cameras are severely afected by occlusion and illumination. Moreover, video-based
surveillance applications require abundant storage, computing, and processing power. The
proliferation and usage of microelectromechanical sensors have made computing operations
ubiquitous, mobile, and resilient. Consequently, sensors are an indispensable component
of security, surveillance, and reconnaissance applications.
Typically, surveillance is divided into two categories, viz., (i) active and (ii) passive
(device-free). In the active category of surveillance, a device is attached to a target’s
body to collect data. But intruders are not supposed to wear or carry devices on
their bodies to aid the surveillance system. Thus, they are not suitable for security
applications. Device-free sensing techniques, on the other hand, are better suited for
security applications. Device-free sensing techniques infer the changes caused by a target
in the surrounding environment using diferent intrinsic traits. A target’s intrinsic traits
are classifed as either static or dynamic. Static traits are always present in a target,
regardless of its activities. For example, weight, shape, scent, refectivity, and attenuation.
When a target engages in an activity, dynamic traits are generated. For example, footstep
vibration and sound are generated when a person walks or speaks.
Numerous device-free sensing techniques are available, but we only use seismic and audio
sensing techniques in our work. Audio and seismic sensors are non-intrusive, inexpensive,
and easy to install. Moreover, they are immune to temperature, wind, and lighting
changes. Applications based on audio or seismic modality require less storage, computing,
and processing power. This thesis focuses on localization and activity recognition using
audio and seismic sensors in an outdoor environment for a single human target.
First, we localize a human target using only seismic sensors. In this work, the sensors
used are identical. Moreover, seismic sensors are omnidirectional, and their sensing
range is assumed to be circular. The intersection point of three circles can be treated
as an estimated target location. Required circle parameters are computed using either
a regression or an energy attenuation model. But when we deployed this approach in
real-life settings, we found that circles may or may not intersect. That is why a heuristic is
proposed. But solutions ofered by the heuristic show high localization error. Therefore, we
replace the proposed heuristic with audio direction information. In this approach, target
distance is computed using regression, and audio direction information is fused to localize a
target. Audio directions may be missing in our experiment, so we propose a mathematical
model for estimating missing audio directions. Missing audio direction estimations are
based on the assumption that at least one audio sensor has captured the target direction.
The missing direction estimator uses an estimated target distance. But distance estimation
may be erroneous, so angle estimates also turn out to be erroneous. As an alternative,
we perform an early fusion of audio and seismic modalities for location estimation. This approach employs multiple audio-seismic features and multi-output regression to localize
a target. Extensive experiments show promising localization results with an error of 0.735
in an area of 324 meter2.
After localization, we focus on non-overlapping human activity recognition for a single
human target using seismic, audio, and audio-seismic modalities. We frst employ
seismic sensors to recognize six human activities: running, jogging, walking, jumping
jacks, and inactivity. The proposed approach uses an autoencoder network to learn
a deep representation of 16 diferent time and frequency domain features with reduced
dimensionality. An artifcial neural network classifer is applied to deep representation for
activity recognition. Extensive experiments demonstrated the precision and recall values
of 0.72 and 0.68, respectively. We found that the recognizing accuracy is afected by
background noise and inter-activity misclassifcation. To reduce misclassifcation, we now
employ the audio modality to recognize an extended set of human activities. The proposed
approach uses a 2D convolution neural network and reduces inter-activity misclassifcation.
However, the efect of background noise is still prevalent. Typically, multi-modal fusion
shows better performance than a single modality. With this motivation, we fuse data
from audio-seismic modalities using a 1D convolution neural network. We found that a
multi-modal human activity recognition framework reduces inter-activity misclassifcations
and reduces the efect of background noise. We achieved an F1-score of 92.00% on a
nine-class classifcation problem. Thus, we conclude that audio and seismic sensors can
be used for both localization and activity recognition with high accuracy |
en_US |