Computer-Augmented Techniques for Musical Articulatory Synthesis

Computer-Augmented Techniques for Musical Articulatory Synthesis

This is a project proposal I made as part of an application for a PhD program. I've slightly adapted it to work better as a wiki page. Ultimately, it got rejected, but I was very close to getting the position.

Included in this page is also a transcribed version of the 5-minute presentation I made to pitch this proposal during the interview. This can be found in the Prelude.

This work is placed under the CC BY-SA 4.0 creative commons license.

Prelude

The overal objective of this project aims to explore and discover new ways to control artificial voice in a musical context.

To better understand the interest and perspectives of this project, consider this video of Sir Ian McKellen performing the opening lines of "The Merchant of Venice", with direction from Sir John Barton.

What if computers could perform melodies like this?

In a performance, it often happens that the way something is said (known as prosody) has equal value to the words themselves. In other words, it's not what, but how you say something that matters. As a computer music composer, I am fascinated by this idea, and have a strong interest in exploring this concept in my work.

As it turns out, vocal-like instruments and sounds lend themselves really well towards musical prosody. For this reason, I've been very interested in vocal synthesis techniques, and in particular, articulatory synthesis.

Articulatory synthesis is a unique approach for producing artificial voice that works by simulating the human vocal tract. Through shaping this virtual vocal tract in different configurations one is able to produce different phonemes, the building blocks of speech, out the other side.

It is a truly malleable method of speech synthesis, with much potential for musical expression unrelated to speech. To demonstrate this, I have created a little Android App that allows one to "sculpt" a simple vocal tract in an articulatory synthesizer. In doing so, it dramatically changes the timbre of the voice.

Abstract

Articulatory Synthesis is a branch of speech synthesis that uses physically based models of the human vocal tract for sound production. While these methods can yield high quality results, the large number of parametric inputs makes them difficult to control. This research aims to develop novel techniques that utilize AI to help musically manipulate and perform these models. Musical interfaces will be constructed with AI-assistance to explore sound spaces. Design principles involving so-called anthropomorphic sensitivity to the disembodied artificial human voice will be formally investigated.

Objectives and Research Questions

The Main Objective

The proposed research efforts outlined below aim to enable individuals to puppeteer computers and get them to ``sing''. It is the hope that the results of these investigations will uncover new musical interactions using the computer medium.

The choice of imitating the human was deliberate. Arguably our oldest companion to music, the voice taps into profound parts of the human experience, and can be used to establish a strong link between the music that brought us here and the music that awaits us.

Research Questions

This research proposal aims to directly address the following questions:

What are the ways that AI can be used to help musically manipulate and perform physically based vocal tract models?

Broadly speaking, it is expected that AI techniques will be used in the problem domain of dimensionality reduction. At the lowest level, this means using AI to find vocal tract parameter spaces that approximate an ideal vowel shape. Higher abstraction layers will be built on top of this work that explore AI in the context of gesture and ensemble.

What are the ideal interfaces for articulating disembodied artificial voices?

Because the sounds produced relate so closely to the human voice, interfaces require a degree of anthropomorphic sensitivity, or an awareness of how humans generally respond to the uncanny nature of disembodied artificial human voices.

How can these new techniques be utilized to push the sonic boundaries of these models?

This research question grounds itself in the aspirations of computer music composers. These articulatory synthesis models, while purpose-built for synthesizing speech, need not be limited to producing spectra associated with the human voice. Hopefully, the control techniques developed will serendipitously uncover new musically compelling timbres and inspire new interfaces to control them. New algorithms lead to new controllers lead to new algorithms.

State of the Art and Background

Research in Articulatory Synthesis for Speech has been relatively stagnant in the last decades, with Deep Learning being more favored. Musical applications for Articulatory Synthesis such as singing are even rarer to find. We are well overdue for a renaissance.

The Voder

In the late 30s, Bell Labs created the Voder , an interface for controlling an electronic voice. Despite being synthesized using rudimentary electrical components, its interface gave it a surprising range of speech prosody . The Voder was notoriously difficult to control, and very few people were capable of effectively performing with it. Often this is the trade-off with control of artificial voice; it is difficult to have interfaces for artificial voice control with high ceilings and low floors .

The 60s: Bell Labs and the Golden Age of Vocal Synthesis

Physically-based computer models for the vocal tract have been around since the 60s, and singing computers have existed for almost as long. 1962, John L. Kelly and Carol C. Lochbaum published one of the first software implementations of a physical model of the vocal tract. The previous year, it was used to produce the singing voice in "Daisy Bell" by Max Matthews, the first time a computer was taught to sing, and perhaps one of the earliest significant works of computer music. This work went on to influence the creation of HAL in 2001: a Space Odyssey, and set up expectations for the disembodied computer voice.

The Rise of the Personal Computer

In the 70s and 80s, computer hardware began to change with the rise of the personal computer. Faster but lower-quality sounding speech techniques such as LPC, concatenative synthesis, and formant synthesis were able to better leverage the new hardware.

In 1991, Perry Cook published a seminal work on articulatory singing synthesis. In addition to creating novel ways for analyzing and discovering vocal tract parameters, Cook also built an interactive GUI for realtime singing control of the DSP model. This was perhaps the earliest time such models could be performed in realtime, thanks to hardware improvements.

Vocaloid and Virtual Pop Stars

In the early 2000s, a commercial singing synthesizer known as Vocaloid was born. Under the hood, Vocaloid implements a proprietary form of concatenative synthesis. Voice sounds for Vocaloid are created by meticulously sampling the performances of live singers. Still in development today, Vocaloid has a rich community and is considered "cutting-edge" for singing synthesis in the industry.

One of the interesting things about Vocaloid is how they address the uncanny valley issues that come up when doing vocal synthesis. Each voice preset, or "performer", is paired with a cartoon anime performer with a personality and backstory. Making them cartoons steers them away from the uncanny valley. Unlike most efforts in speech synthesis, fidelity and even intelligibility are less important. As a result, Vocaloid has a distinct signature sound that is both artificial yet familiar.

Musical Singing interfaces on the Interactive Web

Developments of the web browser in the last ten years have yielded very interesting musical interfaces for synthesized voice.

In the mid 2010s, Neil Thapen developed the web app Pink Trombone , touted as a low-level speech synthesizer. The interface is an anatomic split view of a vocal tract that can be manipulated in realtime using the mouse. The underlying model is a variation of the Kelly-Lochbaum physical model, utilizing an analytical LF glottal model . Pink Trombone served as the basis of Voc \cite{Voc}, a port I made of the DSP layer to ANSI C using a literate programming style.

Much of Neil Thapen's work in Pink Trombone can be traced back to Jack Mullen's DSP dissertation on using 2d waveguides vocal tract control.

In around 2018, Adult Swim released Choir, a web audio powered virtual singing quartet with interactive visuals by David Li with sound design by Chris Heinrichs. Chords are allegedly found using machine learning. In 2020, Li and Google Research teamed up to release Blob Opera, essentially a second iteration of Choir. Blob Opera and Choir both sound physically based, but I can't confirm this, as the source code is not public.

Postlude: Deep Learning

Recently, there have been early attempts at using deep learning to synthesize singing . While it is true the output results are impressive, these are still speech synthesis studies in musicians' clothes, as they tend to focus on fidelity rather than expression.

Research Methodology

The research involved in investigating novel computer-augmented techniques for musical articulatory synthesis can be broken down into the following: developing mental models and frameworks, demo-driven development, and validation studies.

Mental Models and Frameworks

Articulating a disembodied artificial voice requires developing mental models and frameworks that break up the problem into smaller components. The first proposed structure is what I will refer to as the Instrument Pipeline. The Instrument Pipeline maps the high-level components for a hypothetical musical performance interface. It is divided into four layers: interface, mapping, model, and sound.

The Interface Layer concerns itself with the human-computer interactions. Interfaces include peripherals like keyboard and mouse, gamepads, MIDI controllers, or other homegrown sensors built using arduino or similar maker components.

Previous projects such as the Contrenot, Eyejam, or Ethersurface, as well as my work with the Soli and Leap Motion, provide some insight into how I approach physical interfaces in computer music. Built from simple electronics or off-the-shelf devices, control schemes from these physical interfaces always get developed building off of their natural affordances.

The Sound Layer is responsible for emitting sound. For the purpose of this research, the scope of sound transmission sources will be limited to conventional speakers and headphones.

The Model Layer is the DSP algorithm that contains the physical model of the human vocal tract. It is the layer that synthesizes PCM data which is then converted to analogue sound via the DAC.

The Mapping Layer sits between the Interface and Model, and is in charge of converting musical vectors of expression produced by the interface into input parameters for the model. This mapping layer is anticipated to be the core area of research, and where most of the applications of AI-based techniques will be utilized. To fully address the micro and macro concerns for musically meaningful instruments, mapping must be elaborated on further with another framework, in what will be called The Hierarchy of Control. It considers three scales of control: timbre, gesture, and ensemble:

The Timbre scale is the lowest level of control, and concerns itself with manipulating the parameter space of vocal tract models in question. Research will go into using AI techniques to find meaningful vocal tract shapes.

The Gesture scale of control abstracts everything into gestures, continuous trajectories. Gestures are used to navigate the timbre space, and it is here that the perceptual event of a musical note is formed within a phrase. Within this framework, gestures can be synthesized or analyzed from continuous controller interface events. AI intervention will be used to assist these processes.

At the Ensemble scale of control, the paradigm shifts from manipulating one voice to many. The role of the human performer becomes one similar to a conductor, with a focus on macro structure rather than individual sound events. Voices take on more self-described behavior, moving independently with an awareness of the other voices. No era quite captures the beauty of vocal counterpoint like those found in Renaissance Sacred Choral Music two centuries prior to Fux's Gradus ad Parnassum. Inspiration for control and rules will be found studying the works of sacred choral works from renaissance composers such as Palestrina, as well as late-medieval works like those from the Ars Nova.

Demo-Drive Development

In my works I employ an iterative process that I call Demo-Driven Development: the creation of tightly scoped works designed to investigate a particular idea. This is typically in the form of some kind compositional etude with varying degrees of technicality, or an interactive sound toy. An effective demo will inspire conversations that bring momentum for the next iteration. This particular kind of process is important for grounding research in things that are in the bounds of "musically meaningful".

Consider my musical DSP library called Soundpipe. In my initial attempts at composing with Soundpipe, I found it too low level for creative thought. This problem created Sporth, a stack-based language built on top of Soundpipe that could tersely build modular patches. After using Sporth to compose music, it was adapted to work as a live-coding coding environment. This tightened the creative feedback loop. Performance became an issue, so the whole tool was rewritten to be faster. Because more could be done in realtime, the complexity of the patches grew, and new abstraction layers were desired. High level languages like Scheme were introduced with tight integration with the Monome Arc and Grid, which is now the current iteration.

Validation Studies

Validation studies will attempt to quantify ideal characteristics for interfaces articulating disembodied artificial voice.

These studies attempt to measure three qualifiers that make up a musically meaningful interface: expressiveness, intuitiveness, and anthropomorphic sensitivity.

The structure of these studies will consist of surveys and experiments utilizing generated stimuli that are both interactive and non-interactive. Participants will be split into those with and without formal musical training.

Ethical Considerations

Any research projects involved in synthesizing human-like sounds or visuals should proceed with great caution. While this project is indeed related to realistic human speech synthesis, the ethical considerations as it relates to issues like deepfakes is anticipated to be minimal. Rather than set out to make a musical speech engine, this research intends to approach the voice as a musical instrument and a template for exploring new complex timbres and sound structures.

Indicative Timeline

Deliverables

The deliverables of this research project come in three forms: the Dissertation, the Implementation, and the Demonstration. These will be developed over the allocated three year period. While much may change over that time, it is anticipated that these will remain consistent pillars.

The Implementation will be software based on the research discussed in the Dissertation. This will most likely be written as a literate program, a programming paradigm invented by Donald Knuth, which has seen success in many large-scale projects.

The Demonstrations, built on top of the Implementation, will be crafted to convey a core idea found in the novel research. It is anticipated that at least two interfaces will be built. One interface will have a musical instrument form factor exploring timbre and gesture scales of control. The other interface will explore ensemble and gesture scales.

Timeline over 3-year period

It is sensible to align the timeline events with the academic calendar of the University, which breaks up the 3-year period into 6 semesters between Autumn 2021 and Spring 2024.

The first and last semesters will be for general orientation and final revisions, respectively.

The first 3 semesters will have a heavy focus on the Demonstration and Implementation. After that, the focus will shift towards Dissertation and Implementation.

The initial research period of the first year is an important time for scoping and planning. A large emphasis will be placed on demo-driven development. This will help build up an intuition of the domain that will propel into the work done in the second year.

The second year aims to be a busy one. Any kind of fabrication plans or experiment design needs to begin happening by early Autumn 2022 at the latest, as these have a considerable number of moving parts. Spring 2023 will be a crunch to complete as much as possible in order to gracefully meet the deadline in the following year.

The final year is wrap-up. By winter break, all three deliverables should feel comfortably close to completion. The final Demonstrations should be done at this point. By Spring 2024, work should wind down to final revisions and tweaks.


home | index