When
Thematic Session 2: Software and Synthesis (Monday, 15:40)
Abstract
We present initial results of ongoing research to inform a framework for AI-based voice synthesis that generates context-dependent expression. We argue that a neural synthesis framework should leverage the voice persona in order to constrain synthesis parameters and contain an interface allowing for meaningful user-defined abstractions and multimodal interaction. Importantly, we note that to imbue expressivity into a synthesized voice, we first must understand the contexts and modalities within which vocal expression takes shape.
We conducted semi-structured interviews with ten professional vocal artists to understand how the performative voice functions as a communication device and medium of identity, and how artists describe their physical and cognitive vocal embodiment, feedback, and manipulation. Iterative thematic analysis of participant interviews yielded the following concepts as having a strong relationship with or influence on vocal persona when using the voice as an expressive medium: Physical and Sociocultural Contexts, Technological Mediation, Embodiment, Self-Perception, Perception of Others, and Performativity.
We found three main organizing principles present across the above themes, namely: information hierarchy, type of mediation, and agency. Decisions around information hierarchy influenced vocal expression by determining prioritization of semantic information, internal state, auditory/musical information, and shared referential concepts. The method and degree of mediation influenced the adoption of a vocal persona by prioritizing certain production choices deemed necessary to align with physical, sociocultural and/or technological contexts. The agency of a speaker to consciously adjust their voice, both between and within personas, to their surrounding context was both important and ubiquitous.
In our presentation, we will expand upon the above themes, organizing principles, and additional insights resulting from our qualitative analysis, as well as discuss the details of our proposed framework for expressive voice synthesis and its relationship to these themes and organizing principles.
Bio
Camille is fascinated by the human voice as a means of expressive communication, and by the impact of one's experiences and environments on their voice. Her interdisciplinary research utilizes signal processing (DSP), machine learning (ML) and human-computer-interaction (HCI) techniques in combination with psychoacoustics and vocal science. She seeks to understand how current technologies leverage and position the voice to augment expressive, humanistic, and equitable technological interaction.