Abstract:
Understanding human behaviour in social contexts is key to building intelligent
interaction systems that interact with people naturally and e”ciently. While
AI has made substantial advancements, especially in vision and language, there
remains a gap in AI systems that can e!ectively understand and explain human
traits, emotions, and social interactions. Current state-of-the-art models struggle
to accurately predict human behaviours in the intricate and subjective contexts of
real-world environments. To address this gap, the thesis proposes novel methods
that not only predict human-centered behaviours but also provide interpretable
insights into the underlying reasons behind these predictions. To enhance AI’s ability
to understand and interact with humans more intuitively, we leverage multimodal
data sources such as head motion, facial expressions, speech, and gaze behaviour.
In the end, this work aims to develop AI systems that can e!ectively interpret and
respond to human behaviours in complex real-world social settings, with applications
spanning social robotics and personalized human-computer interaction.
This thesis presents a comprehensive exploration of human behaviour prediction
in social contexts with explanations, addressing key challenges in understanding
personality and behavioral traits, group behaviours, and social interactions. The
first contribution demonstrates the utility of elementary head-motion units, termed
kinemes, for behavioral analytics. By transforming head-motion patterns into
sequences of kinemes, we uncover latent temporal signatures that enable e”cient and
explainable predictions of personality and interview traits. Building on individual
traits, our second contribution investigates the significance of body language
behavioral cues in social interactions, particularly focusing on gestures and body
movements. We propose a multiview attention fusion method, MAGIC-TBR, which
combines features from videos and their discrete cosine transform coe”cients to
capture finer behaviours like gesturing, grooming, and fumbling.
In analyzing the bodily behaviour of participants in group settings, we observe that
in every multiparty activity, one (or more) dominant personality typically takes the
lead and is referred to as the “Most Important Person” (MIP). However, current
datasets lack su”cient resources to train models to identify the MIP accurately.
Existing “in-the-wild” datasets are either too small in size or do not cover a wide range of social situations. To address this, our third contribution of this
thesis is the proposal of a large-scale, ‘in-the-wild’ dataset designed to capture
human perceptions of importance in social images, along with the introduction
of a novel approach for estimating the Most Important Person (MIP) in group
settings, called MIP-CLIP. Through extensive benchmarking with state-of-the-art
MIP localization methods, we highlight the need for more robust algorithms capable
of handling real-world scenarios. The dataset and approach aims to significantly
advance research in understanding social situations. In addition to identifying
the Most Important Person (MIP) in group settings, we also find that the role of
social gaze behaviours, such as mutual-gaze and shared attention, is also critical in
understanding social interactions. These gaze cues provide valuable insights into the
dynamics of communication and further inform the prediction of social behaviours
within group contexts. Finally, we extend the analysis of gaze behaviours during
dyadic communication (where two persons involve in a conversation). We propose
a network designed to recognize these gaze patterns in images, providing deeper
insights into the dynamics of social interaction.
This work advances the development of explainable models for predicting human
behaviour and lays the foundation for future progress in understanding social
interactions in both individual and group contexts. The findings of this thesis
enhance the understanding of human-centered social interaction dynamics, o!ering
insights into the success of both individuals and groups in human-to-human and
human-computer interactions.