Abstract:
ABSTRACT
This paper presents a framework for generating appropriate facial expressions for a listener engaged in a dyadic conversation. The ability to produce contextually suitable facial gestures in response to user interactions may enhance the user experience for avatars and social robots interaction. We propose a Transformer and Siamese architecture-based approach for generating appropriate facial expressions. Positive and negative Speaker-Listener pairs are created, applying a contrastive loss to facilitate learning. Furthermore, an ensemble of reconstruction quality sensitive loss functions is added to the network for learning discriminative features. The listener's facial reactions are represented with a combination of the 3D Morphable Model's coefficients and affect-related attributes (facial action units). The inputs to the network are pre-trained Transformer-based feature MARLIN and affect-related features. Experimental analysis demonstrate the effectiveness of the proposed method across various metrics in the form of an increase in performance compared to a variational auto-encoder-based baseline.