ESR9: Acoustic-phonetic alignment in synthetic speech

PhD Fellow: Johannah O’Mahony

My name is Johannah O’Mahony and I’m originally from Dublin, Ireland. I studied General Linguistics at the University of Amsterdam where I became increasingly interested in phonetics and psycholinguistics. After my bachelor’s degree I moved to Saarbrücken to do a M.Sc. in Computational Linguistics where continued to study phonetics and psychophonetics. During this time, I also focused on speech processing and speech technology such as Text-to-Speech and Automatic Speech Recognition. After my degree, I worked in industry in Germany and Austria for two years with a focus on both TTS and ASR.

My current project as part of COBRA is concerned with phonetic convergence and how we can implement convergence behaviour in synthetic speech using the latest speech synthesis models. I will also be testing on whether implementing convergence can have a positive effect on human-computer interaction.


To discover whether there is a benefit from employing acoustic-phonetic alignment in synthetic speech. Here, alignment refers to changes in pronunciation (e.g., vowel space) or prosody (e.g., pitch range, speaking rate). It is already possible to control these and related factors such as vocal tract length in Hidden Markov Model (HMM) speech synthesis, using machine learning methods previously developed by UEDIN and its collaborators. Possible applications are human-machine dialogue and human-human interaction where one of the humans is using a voice-output communication aid. Changing vocal tract length and pitch range gives control over perceived gender.

Expected results:

Extension of control techniques from HMM to the latest Deep Neural Network synthesis, providing controllable high-quality synthetic speech for use in experiments; control of this system using vocal tract length, vowel space and prosody measured (using already-available signal processing methods) from the human interlocutor’s speech; initial results regarding the effects on listeners of controlling the above factors, in a non-interactive situation; final results regarding the effectiveness of acoustic-phonetic alignment on interlocutor behaviour and overall task success / user satisfaction in both simulated human-machine and real human-human interaction scenarios.

Based in Edinburgh, UK

Full-time three-year contract, starting September 2020

PhD enrolment at: University of Edinburgh

Main supervisor’s institution: University of Edinburgh

Main supervisor: Prof Simon King


  • ReadSpeaker, Uppsala: to apply the developed methods to commercial-quality synthetic speech (5,5 months);
  • University of Helsinki: application to user-adaptive systems for the improvement of second-language pronunciation, in which the teacher (i.e., machine) aligns with particular aspects of the student’s speech: first matching gender, then prosody, then more subtle effects (5 months).

Co-supervisors’ institutions:

  • ReadSpeaker, Uppsala, Sweden
  • University of Helsinki, Finland

Scroll to top