Speech as a Model for Orchestration

Speech as a Model for Orchestration

Individual Project Report

Author

Louis-Michel Tougas (McGill University)

Published: May 27th, 2025

The general objective of this project was to conduct research on speech as a timbre model for orchestration. Drawing from computer analysis of speech, I aimed at exploring how language and its acoustic properties could serve as a compositional model. This model can be seen from two perspectives, the first one being the generation of musical material based on phonetic sound structures, and the second one the organization of this material according to rules analogous to those found in natural languages.

1. Starting Point

In 2021, one year before beginnning the present project, I took part in the Composer-Performer Research Ensemble [CORE] Project at McGill University. The first part of this project took the form of a seminar in which a specific terminology was presented for composers and performers to share a common vocabulary regarding timbre and orchestration. This terminology was then used to analyze various seminal works by composers such as Györgi Ligeti and Kaija Saariaho, with the focal point of considering timbre as a form-bearing parameter. Preliminary collaborative work between composers and performers was conducted, especially regarding the possibilities of production of particular timbres through the use of extended techniques. Each performer was asked to give an overview of the common techniques for their instrument, while composers were asked to develop a seed idea or a specific orchestration problem that would later serve as the starting point for writing a piece for the ensemble.

My preliminary goal was to work with motives whose identities would be defined by their timbral properties rather than by characteristics such as pitch contour or rhythmic proportions. While a motive possesses a certain perceptual unity that allows the listener to recognize it as such, it is, from a parametric point of view, already a compound unit made up of smaller components. The question that then arose was: if traditional musical motives are made up of chords, notes, and rhythmic cells, what would “timbral motives” be made of?

That question was to be explored in the specific context of a piece for seven acoustic instruments, where access to the internal properties of sound is limited—unlike in computer-based composition, where any sound can be manipulated in virtually infinite ways “from the inside.” The latter approach has been adopted in the context of the present project.

 

Recording of Etude, by Louis-Michel Tougas. CORE Round 2, April 2022.

 

I later had the opportunity to participate in an exchange and to present the resulting piece at the University of British Columbia, where it was performed by the UBC Contemporary Players in Vancouver. While at UBC, I met composer Darren Xu, who was using the Cantonese language as inspiration for his compositional work. For both of us, the interest in speech as a model emerged from a shared search for a more formalized way of organizing timbre-based material in our music. The project described below benefited greatly from my exchange with Darren and his approach to composition inspired by speech.

2. Initial Objectives : Timbral Imitation of Speech

My initial objective was to use excerpts of spoken voice in Québécois French as acoustic models for orchestration. By analyzing these excerpts using various computer techniques, such as formant analysis, my general aim was to generate short-span instrumental combinations that would resemble spoken syllables, words, or short sentences.

I began by examining several recent examples of this kind of approach to orchestration.

Notable differences among them included the type and complexity of the technological means involved, the degree of formalism each composer adopted, and the extent of resemblance between the spoken voice and the musical realization.

Recognizing that speech has served as a model in the music of many cultures across various eras, I chose to focus on more recent examples. One such case is Jonathan Harvey’s Speakings, for orchestra and electronics, which is perhaps among the most formalized attempts at directly imitating the speaking voice through orchestration. For this piece, Harvey made use of the computer-assisted orchestration software Orchids, developed at IRCAM.

With these examples in mind, I set out to develop my own personal approach to using speech as a model for orchestration.

I quickly came to understand the complexity and inherent dynamism of speech as a phenomenon. As a result, treating phonemes as static entities defined by fixed acoustic characteristics would not yield an accurate model—even within the bounds of my relatively flexible approach to speech imitation. One important aspect, which will be discussed further, is that phonemes undergo significant acoustic modifications depending on their position within a word and the surrounding phonemes. While this complicates their characterization using simple spectral or formant models, it also offers a compelling analogy with orchestration conceived as a temporally dynamic process rather than a combination of static elements.

This quote by early 20th Century linguist Edward Sapir nicely sums up this realization:

These [phonemes] are in actual behavior individually modifiable; but the essential point is that through the unconscious selection of sounds as phonemes definite psychological barriers are erected between various phonetic stations, so that speech ceases to be an expressive flow of sound and becomes a symbolic composition with limited materials or units. The analogy with musical theory seems quite fair. Even the most resplendent and dynamic symphony is built up of tangibly distinct musical entities or notes which in the physical world flow into each other in an indefinite continuum but which in the world of aesthetic composition and appreciation are definitely bounded off against each other, so that they may enter into an intricate mathematics of significant relationships. The phonemes of a language are in principle a distinct system peculiar to the given language, and its words must be made up, in unconscious theory if not always in actualized behavior, of these phonemes.
— Sapir, 1933

3. Compositional Approach

The approach I initially decided to adopt was to start with what I understood to be the most basic sound unit of natural languages and build higher-order units from there. With my limited knowledge of linguistics, I understood that the most basic units of natural languages were phonemes, which could then be combined to form syllables, words, and so on. I also assumed that phonemes were acoustically defined units.

With these assumptions in mind, my general objective was to record sound files of instrumental techniques that would roughly resemble phonemes in my own language, Québécois French. I chose to limit myself to one instrument—the piano—and to write an electronic part composed solely of recorded sound files featuring various piano techniques.

I therefore started by establishing a list of phonemes present in Québécois French. A complete list could be found on the website of the Office Québécois de la langue française, which is reproduced below. With the help of pianist Rosane Lajoie, who also specializes in coaching singers with their French pronunciation, I recorded all the phonemes in the list, first out of context, and then in various positions within a set of words. Since I thought phonemes were acoustically defined units, the aim at this point was to gather sound files that could then be analyzed in a straightforward manner, and characterized according to various formant structures, for example.

I also tried to get a better understanding of the essential acoustical differences between Québécois and normative French, so we recorded another set of phonemes and words with both normative and Québécois French pronunciation.

It seemed that three important differences in pronunciation between normative and Québécois French are the presence in Québécois of now-obsolete phonemes in normative French, as well as the use of affricates and diphthongs.

The following sound files illustrate the difference in pronunciation between normative and Québécois French for the vowel /i/:

 
 

Affrication is the addition of a /s/ or /z/ between the consonants /d/ or /t/ and the vowels /i/ or /y/. For example, the French words “bandit” and “diphtongue” are pronounced as if there was an added /z/ “bandzit” and “dziphtongue”.

 
 

A diphthongue, from the Latin diphthongus - “double sound” - is a “vowel composed of two successive timbres, caused by a modification in the articulation during the emission”[1]. While diphthongues were common in ancient French, they completely disappeared from modern French as commonly spoken in France. However, Québécois French retains a certain number of diphthongues, which is one of its most prominent particularities, in comparison to normative modern French. For example, a typical way of pronouncing the word cinq (five) in Québécois French would be to diphthongize the vowel and turn into two distinct sounds. “Strong diphthongization today is socially stigmatized”[2], sometimes as a marker of rurality or lesser education.

The following examples illustrate the morphing between two vowel sounds in the word “cinq”. The first one is a “standard” version, while the second one is somewhat exaggerated.

 
 

Further reading about phonology - the branch of linguistics that studies how languages organize sounds - I soon realized that my initial assumptions were either very incomplete, or more or less wrong.

For example, it appears that phonemes are not actually defined by some fixed acoustic properties:

The concept of the “phoneme” (a functionally significant unit in the rigidly defined pattern or configuration of sounds peculiar to a language), as distinct from that of the “sound” or “phonetic element” as such (an objectively definable entity in the articulated and perceived totality of speech), is becoming more and more familiar to linguists. The difficulty that many still seem to feel in distinguishing between the two must eventually disappear as the realization grows that no entity in human experience can be adequately defined as the mechanical sum or product of its physical properties.
— Sapir, 1933

However, this realization led to an interesting point of comparison with music composition, which is, that meaningful units inside a piece of music are also not defined “as the mechanical sum […] of [their] physical properties”, regardless of the nature of the unit itself. For example, even the clearest motive, whether characterized by pitch contour, rhythmic proportions, or timbre-based perceptual criteria, still presents a certain degree of ambiguity. In effect, even such a motive could not be identified with precise acoustic measurements, and therefore it only exists as a psychological entity.

Another consequence of that realization was that my goal of musical imitation of phonemes taken out of context would be impossible to realize, since the acoustical behavior of phonemes themselves was subject to change, depending on the surrounding context and their position inside a syllable. These limitations, along with the choice of relying only on the piano, made me turn to a type of speech imitation that had more to do with analogies with the organization of phonemes inside words than with a more direct type of imitation that aimed at producing intelligible words from instrumental sounds.

Since phonemes could not be defined or analyzed out of a specific context, the next step I took was to perform analysis of complete words. For example, I recorded the word “cuir” (leather) and examined how the waveform changed in time in a more-or-less continuous fashion. This contrasted with my initial conception that words would be made of a series of compounded steady states, and later influenced how I would think about the compounding of instrumental sounds. While the red lines indicate a clear boundary between phonemes, one can observe that the waveform changes continuously, with varying degrees of smoothness, depending on the position inside the word and the phonemes themselves. For example, the transition between the combination of the semivowel /ɥ/ and the vowel /i/ is almost imperceptible by just looking at the waveform, while the difference between the /i/ and the fricative consonant /R/ is much more obvious. However, a quick but continuous transition from one waveform shape to the other can be observed, even between the “toned” vowel and the “noisy” consonant.

 

Figure 1. Waveform representation of the word “Cuir” in Québecois French.

 

Using the Praat software for speech analysis, I could get a much clearer representation of the behavior of each phoneme and the interaction between them:

 

Figure 2. Spectral representation of the word “Cuir” in Québecois French.

 

For example, one important characteristic that can be noticed in this spectrogram is the continuous aspect of every discrete phoneme. For example, the semivowel /ɥ/ does not present a steady state at any moment during its emission and is rather characterized by a rapid change in the distribution of energy along the spectrum. On the contrary, there is a short noise between the initial plosive |k| and the semivowel /ɥ/, called “aspiration”. It was quite interesting to learn that every consonant presents a complex temporal behavior, and not only a fixed frequential structure. For example, the plosives such as /k/ can be characterized according to four different acoustic phases: duration of the stop gap, voicing bar, release burst and aspiration. All these properties could eventually be integrated as a refinement of a model for generation of speech-inspired, timbre-based musical objects.

3.1 Phonotactics

To get a better understanding of how phonemes are compounded into words in natural languages, I started reading about phonotactics, or “the area of phonology concerned with the analysis and description of the permitted sound sequences of a language”. My idea at this point was that the way timbral units are organized in speech could be imitated in music, without necessarily referring to any specific word, or even without trying to acoustically imitate phonemes of a given language, as long as the phoneme-like units of music composition could be aurally differentiated between each other.

I decided to preserve the distinction between vowels and consonants, or the fact that the first ones present a high degree of “toneness”, and that the second category is rather frequentially aperiodic and therefore does not present much audible pitch.

Several rules could then also be retained for the compounding of phoneme-like instrumental basic sounds. The following table gives the most frequent combinations of phonemes in French, Spanish, English and German. An obvious difference between Germanic and Latin languages appears in the most frequent syllable type, consonant-vowel in French and Spanish and consonant-vowel-consonant in English and German.

 

Table 1. Most frequent syllabic types in French, Spanish, English, and German (Léon 1992)

 

These syllable structures could then be adapted to my own set of recorded piano sounds. As a preliminary experience, I decided to use the basic model of syllable formation, that is a mandatory nucleus, which is always a vowel, or toned sound in my case, surrounded by optional onset and coda, that must be consonants. In natural languages, these general rules are summarized as the following, but they can be freely adapted to one’s musical needs [3]:

  1. The basic model is : Onset (optional) + Rhyme [Nucleus (obligatory) + Coda (optional)]

  2. The Nucleus must be a vowel (toned sound)

  3. Onset and Coda have to be consonants (toneless sound)

  4. Only certain combinations of consonants are allowed in onset and coda position

  5. These rules change from a language to another

The fourth and fifth rules can be defined according to the compositional need, and can even be changed during the course of a work. However, they provide effective ways of determining the sorting of timbre-based sound events in a complex hierarchical fashion.

3.2 In Practice

The following examples aim at illustrating my approach concerning structural imitation of words in order to generate timbre-based motives. All basic sounds are recorded soundfiles from a piano, and no treatment is used other than transposition and cutting.

 

Figure 3. Sonogram representation of motifs modelled after speach

 

The first sound to be determined according to the syllabic model is the nucleus, which should present some degree of “toneness”.

 
 

An onset, which can be made of either one or two consonants (in this case two), can then be added. For the time being, I did not consider any rule that prohibits certain consonants to be put together, but this could also be an eventual further refinement.

 
 

Finally, a single consonant coda was added to form a complete syllable.

 
 

Eventually, an even more convincing result could potentially be obtained with the use of cross-synthesis between the phoneme-like units. This addition would allow a better imitation of the behavior of speech: since the phonemes rarely consist of only a steady state, this would provide a more natural-sounding result. For the writing of the electronic part of my piano and electronics piece, a partial automation of the process of compounding the sound files together could be achieved using OMChroma, a library for the computer-assisted composition software OpenMusic. It should prove feasible to work using rule-based algorithms of compounding and to generate “words” and sentences algorithmically. However, in the context of this first attempt, all the compounding of my “phonemes” was achieved by hand in Reaper.

This strategy was generalized to produce a few dozen short soundfiles, to be triggered at exact moments in relation to the piano part. Different combinations of the basic consonant-like or vowel-like sounds were produced, with varying order, but always with the intention of producing word-like units. Figure 4 shows an example of the notated electronics and piano parts together. The notation of the electronics also uses, from time to time, the International Phonetic Alphabet as an indicator of the type of sonority, as if the part was sung. However, the vaguely sibilant /s/ here is produced through a recording of brushing the frame of a piano, rather than through using the voice.

 

Figure 4. Phonotactiques, mm. 14-15

 

The /o/ part, on the other hand, is made through mixing different piano sounds with the attack cut, and using formant filtering corresponding to an /o/. The resulting sounds have a varying degree of realism, but in any case, the structural approach to compounding independent soundfiles remained of compositional interest to me.

The resulting piece — Phonotactiques, for piano and electronics — was performed by pianist Rosane Lajoie and myself at the live@CIRMMT event at McGill University in October 2023. In the piece, the electronics and the instrumental parts act as a duo. The electronic part is only made of soundfiles, and displays an autonomous behavior rather than an extension of the piano part. All the soundfiles were produced according to the rules described above, compounding phonemic-like sounds that originate from piano recordings. This strategy allowed the electronics and the instrumental part to remain in a similar timbral space, even though their behavior is completely different.

 

Video of the Premiere with Rosane Lajoie, Piano and Louis-Michel Tougas, Electronics. Live at CIRMMT, October, 2023.

 

5. Conclusion

In this study, I presented a personal way of dealing with speech as a timbre model for orchestration, which I believe opens a promising path for a rich timbre-based compositional approach. While the preliminary research on phonemic analysis and speech synthesis has proven of great interest, the final approach I adopted draws more on the structural properties of speech and how individual sounds are aggregated than on imitating these sounds.

As mentioned in the introduction, many different characteristics of natural languages can serve as a model for orchestration, whether they are specifically related to their timbral components, or more general temporal structures. However, as I have briefly shown above, the frequency domain and the time domain are in fact completely interrelated in the context of speech, as the acoustic realization of every phoneme is in reality dependent on its position inside a syllable, and vice-versa.

 Using a very broad definition of orchestration such as the one proposed by McAdams : “Orchestration involves the choice, combination or juxtaposition of sounds to achieve a musical end.”, it could even be said that orchestration inspired by – or based on – structures of natural languages dates as far back as the Greek Antiquity, where instrumentation choices were made according to the expressive needs and syllabic structure of a given poem, for example. Therefore, as timbre is in itself a temporally dynamic phenomenon, even the lowest-level attempt at using speech as a model for orchestration relies, consciously or not, on a complex network of temporal and frequential interactions.

A second essential point concerns the degree with which one attempts at literally imitating speech with the means of instrumental or electroacoustic synthesis. For example, using advanced technology and extensive instrumental and electronics means, the orchestral piece Speakings by Jonathan Harvey achieves the objective of making the orchestra “speak” with a remarkable degree of realism. However, one can notice that in my approach, it is more a single aspect of speech that has been taken as a model, rather than its general acoustic characteristics. In that sense, speech serves as a more indirect model for material generation than in the case of Harvey.

After multiple initial experiments, I decided to rely more on a broader analogy inspired by perceptual and structural characteristics of Québécois French, rather than attempting to directly imitate words from the language itself. Specifically, phonotactics, or the rules governing the permissible combination and sorting of phonemes, has become the main point of interest. This approach was adopted with the conscience that it would lead to a less realistic imitation but could potentially open other possibilities and freer ways of thinking about orchestration.

After starting the research process using general music-oriented analysis softwares such as Spear, I then started using Praat for performing formant analysis, a software specifically aimed at research and analysis in the field of phonology. However, due to the abovementioned realizations concerning the limitations of direct speech translation to orchestration, I decided to adopt a more liberal approach and used various techniques such as simply zooming closely in the time domain to examine the temporal relations between phonemes, for example, or more general-purpose spectrograms such as those allowed with the software Partiels.

A secondary aspect to consider concerns the temporal span of the musical objects I generated through the models produced during this project. While the strict recognition of speech highly depends on the relative onsets of each phoneme within a word, for the time being, my use of instrumental sounds requires a more intuitive, case-specific approach. For potential future research, the topic of the relation between phonetic position and temporal onset would be of great interest, as it also greatly impacts the way the listeners compound or segregate the various acoustic units that form the speech-like musical motive I generate.

6. Aknowledgements

I would like to thank the ACTOR project for supporting this research project through the Collaborative Student Grant, as well as composer Darren Xu, with whom I interacted constantly during the research process. Many thanks to pianist Rosane Lajoie, who participated in the recording of the piano sound files, as well as the recording of all the phonemes of Québécois French. I would also like to thank Stephen McAdams and Roger Reynolds for their insight. Lastly, thank you to reviewers Andres Guttierez Martinez and Christopher Soden for their additional comments and suggestions.

7. References

  • Pritchard, N., Wang, R., & Fels, J. (n.d.). Ubiquitous Voice Synthesis: Interactive Manipulation of Speech and Singing on Mobile Distributed Platforms”. In Proceedings of CHI 2011: Extended Abstracts on Human Factors in Computing Systems (pp. 335–240). Vancouver, B.C., Vancouver Convention Centre.

  • Barlow, C. (1998). On the spectral analysis of speech for subsequent resynthesis by acoustic instruments. Forum Phoneticum, 66, 183–190.

  • Carlson, R. (1995). Models of Speech Synthesis. Proceedings of the National Academy of Sciences of the United States of America, 92(22), 9932–9937.

  • Grégoire, C., & Bresson, J. (2009). Interacting with Symbolic, Sound and Feature Spaces in Orchidee, a Computer-Aided Orchestration Environment (accepted for publication). Computer Music Journal.

  • Cook, P. R. (1996). Singing voice synthesis: History, current work, and future directions. Computer Music Journal, 20(3), 38. doi:10.2307/3680822

  • Martin, P. (2002). « Le système vocalique du français du Québec. De l'acoustique à la phonologie », La linguistique, 38(2), 71-88. 

  • Gilbert, N., Cont, A., Carpentier, G., & Harvey, J. (2009). Making an orchestra speak. Sound and Music Computing. Porto, Portugal.

  • O’callaghan, J. (2015). Mimetic instrumental resynthesis. Organised Sound, 20(2).

  • Rodet, X., Potard, Y., & Barriere, J.-B. (1984). The CHANT project: From the synthesis of the singing voice to synthesis in general. Computer Music Journal, 8(3), 15. doi:10.2307/3679810

  • Ruwet, N. (1959). Contradictions du langage sériel. In Revue belge de musicologie 13 (pp. 83–97).

  • Sapir, E. (1933). Language. Encyclopaedia of the Social Sciences, 9, 155–169.

  • Smalley, D. (1997). Spectromorphology : explaining sound-shapes”. Organised Sound, 2(2), 107–126.

    Additional sources

  • Diphtongue: https://usito.usherbrooke.ca/d%C3%A9finitions/diphtongue

  • A corpus-aided English pronunciation teaching and learning system and teacher training: https://corpus.eduhk.hk/english_pronunciation/

  • Phonotactics : https://www.merriam-webster.com/dictionary/phonotactics

  • Reconnaissance de phonèmes par analyse formantique dans le cas de transitions voyelle-consonne: https://www.chireux.fr/mp/TIPE/ADS/Reconnaissance%20vocale.pdf

  • Syllabic structure: https://www.sfu.ca/fren270/phonologie/page4_7.html#start

  • IPA in French:  https://vitrinelinguistique.oqlf.gouv.qc.ca/22137/la-prononciation/notions-de-base-en-phonetique/les-symboles-de-lalphabet-phonetique-internationa

  • La prononciation du Français québécois: https://usito.usherbrooke.ca/articles/th%C3%A9matiques/dumas_1

  • John McCarthy - Timbral analysis: https://ccrma.stanford.edu/~jmccarty/formant.htm

  • UCLA Phonological Segment Inventory Database. (2019).

Next
Next

The Line: An etude of juxtaposed timbral approaches