Food and Paper: Investigating Auditory-Visual Perception using Multimodal Neural Networks with the SoundActions Dataset

This week's Food and Paper will be given by Arthur Jinyue Guo

Time and place: Mar. 12, 2025 12:15 PM – 1:00 PM, RITMO common area / Zoom

Arthur Jinyue Guo - RITMO Food & Paper presentation

Abstract

Musicologists, psychologists, and computer scientists study relationships between auditory and visual stimuli from very different perspectives. These disciplines have gaps due to various terminologies, methodologies, and interests. This research aims to bridge the gap between phenomenological sound theory, auditory-visual perception theory, and audio-video processing and machine learning. We introduce the SoundActions dataset, a collection of 365 audio-video recordings of (primarily) short sound actions. Each recording has been human-labeled and annotated according to Pierre Schaeffer��s sound theory reduced listening, which describes the property of the sound itself (e.g., "an impulsive sound") instead of the source (e.g., "a bird sound"). With these reduced-type labels in the audio-video dataset, we conducted two experiments: 1) fine-tuning the latest audio-video Transformer model on the reduced-type labels in the SoundActions dataset, proving that the model can recognize reduced-type labels, and observing the modality-imbalance phenomenon is similar to the added value theory by Michel Chion; 2) proposing the Ensemble of Perception Mode Adapters (EoMPA) method inspired by Pierre Schaeffer��s three listening modes, improving the audio-video model also on reduced-type tasks.

Bio

Arthur Jinyue Guo is currently a Doctoral Research Fellow at RITMO. As a part of the AMBIENT project, his thesis studies the patterns and perception of immersive audio-visual rhythms by combining computer vision, music information retrieval, audio-visual perception theory, and immersive media technologies. Drawing from Michel Chion's concept of synchresis ��the idea that we do not simply hear and see films but hear/see them in a unified manner��the study investigates the complex, non-linear relationship between auditory and visual stimuli. To bridge the gap in multi-modal machine learning and audio-visual perception theory, the thesis brings synchresis into audio-visual signal processing and neural networks. Spatial audio and video technologies are used to study the spatiotemporal characteristics of the environments. The overarching goal is to deepen our understanding of human synchresis as well as to leverage the ability of machine synchresis.

Organiser

Benedikte Wallace and Pedro Lucas

Published Mar. 4, 2025 8:42 AM - Last modified Mar. 4, 2025 8:42 AM