Abstract: In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent ...
Abstract: The widespread adoption of Transformers in deep learning, serving as the core framework for numerous large-scale language models, has sparked significant interest in understanding their ...