Auditory Scene Analysis

Timbre LingoEssays

Feb 12

EN | FR

Auditory Scene Analysis

by Jason Noble
Timbre Lingo | Timbre and Orchestration Writings

Published: February 12th, 2019 | How to cite

Auditory Scene Analysis describes how our perceptual system parses the incoming complex vibration (sound) in order to produce a meaningful representation of the environment. It involves the process of grouping or separating sound events in time, which is called auditory streaming. The elements can either be grouped together (integration), separated in layers (segregation) or separated in successive events (segmentation). Several principles serve as “guides” to auditory streaming, including harmonicity, onset synchrony, frequency comodulation, amplitude comodulation and source location.

This spectrogram represents a 15-second excerpt from Robert Normandeau’s large-scale, multimovement electroacoustic composition Clair de Terre (1999). The movement from which this excerpt is taken is called “Micro-montage” and the music lives up to its title: many brief sound events are juxtaposed or superimposed in a short amount of time. Listen to this clip and you will hear a large, almost overwhelming number of sound events from many different sources, deployed so rapidly that it can be hard to keep track of them all.

Which raises the interesting question: how do we keep track of them all? How does the auditory system make sense of the amazingly complex and ever-changing air vibrations that reach the ear? In this excerpt, I can hear crashing chords, whistling wind, chirping birds, a revving motorcycle, and many more sound sources that I vaguely recognize but am hard-pressed to name. Although this particular panoply occurs in the context of a piece of electroacoustic music, the experience of being bombarded with many different sounds is familiar from what the American psychologist William James called 'the blooming, buzzing confusion' of everyday life. How does the auditory system separate all of these sources into discrete perceptual units?

The process of parsing the incoming sound signal into a meaningful representation of the environment is called auditory scene analysis. A complete explanation of auditory scene analysis is beyond the scope of any blog post—the book in which Albert Bregman gave the idea its fullest exposition is nearly 800 pages long!—but we can introduce some basic concepts here.

As the spectrogram above makes clear, we often process many incoming frequencies at the same time, and the auditory system must decide which ones go together (integration) and which ones should be separated (segregation). For example, in the noisy scene of a city street at any given time, some of the sound components reaching your ears may belong to a motorcycle driving by, others to ambient traffic noise, and still others to voices of people on the sidewalk next to you: your auditory system deciphers which is which. Additionally, the auditory system must group incoming sound components into units that are delimited in time (segmentation), for example musical notes, and decide which ones to group together into extended sequences such as melodies. This is called auditory streaming.

Complicated though it may be, there are fortunately relatively few principles that guide the auditory system through this task. They are:

Harmonicity: Frequencies (or partials) related by simple integer ratios tend to group together. For example, if the auditory scene contains frequencies at 110 Hz, 220 Hz, and 330 Hz (n, 2n, 3n), the auditory system will tend to fuse them together into a single complex sound, whereas frequencies at 110 Hz, 201 Hz, and 350 Hz, which are not related by simple ratios, are less likely to fuse. To get a better insight of the harmonicity principle, see this example from Albert Bregman’s website.
Onset synchrony: Sound components that begin within a very short time window of about 30 milliseconds tend to group together.
Frequency comodulation: Sound components that get higher or lower in parallel tend to group together. For instance, you can hear in this example how our perceptive system tends to group the components of a complex tone based on their frequency comodulation.
Amplitude comodulation: Sound components that get louder or softer in parallel tend to group together.

Source location: Sound components that originate from the same physical location in space tend to group together. As an example, this demonstration shows how we tend to integrate or segregate auditory streams based on their perceived source location (panning).

For the most part, we are unaware that this process is happening, and take it for granted. But before the auditory scene makes it into your conscious awareness, an amazing feat of pre-attentive analysis has already converted the dizzying complexity of air vibrations around you into a coherent picture of the world.

REFERENCE

Bregman, A. (1990). Auditory Scene Analysis : The Perceptual Organization of Sound. The MIT Press. https://doi.org/10.7551/mitpress/1486.001.0001