The approach to Video-Audio Emotion Recognition takes advantage of gaining additional information from multimodalites. Since the target features are time related without strict alignment in time, video-audio features become simply video features and audio features. Exploring toward such a goal, spectrogram as outstanding vocal feature in neural network solution is selected to get benefits of convolution filters. Inspired by solution of image captioning of LSTM where embedded words information and image information are spatially aligned, we perform embedding of the audio spectrogram and image sequences since time information is converted to spatial information in spectrogram. We propose both architecture and framework optimizing the alignment of the mentioned temporal features and we provide the analysis of the significant performance improvement along with the discussion of the Video-Audio Emotion Recognition general tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.