Food and Paper: Investigating Auditory-Visual Perception using Multimodal Neural Networks with the SoundActions Dataset

This week's Food and Paper will be given by Arthur Jinyue Guo

Arthur Jinyue Guo - RITMO Food & Paper presentation

Abstract

Musicologists, psychologists, and computer scientists study relationships between auditory and visual stimuli from very different perspectives. These disciplines have gaps due to various terminologies, methodologies, and interests. This research aims to bridge the gap between phenomenological sound theory, auditory-visual perception theory, and audio-video processing and machine learning. We introduce the SoundActions dataset, a collection of 365 audio-video recordings of (primarily) short sound actions. Each recording has been human-labeled and annotated according to Pierre Schaeffer’s sound theory reduced listening, which describes the property of the sound itself (e.g., "an impulsive sound") instead of the source (e.g., "a bird sound"). With these reduced-type labels in the audio-video data