Abstract
Musicologists, psychologists, and computer scientists study relationships between auditory and visual stimuli from very different perspectives. These disciplines have gaps due to various terminologies, methodologies, and interests. This research aims to bridge the gap between phenomenological sound theory, auditory-visual perception theory, and audio-video processing and machine learning. We introduce the SoundActions dataset, a collection of 365 audio-video recordings of (primarily) short sound actions. Each recording has been human-labeled and annotated according to Pierre Schaeffer’s sound theory reduced listening, which describes the property of the sound itself (e.g., "an impulsive sound") instead of the source (e.g., "a bird sound"). With these reduced-type labels in the audio-video data