By now, most people have probably interacted with conversational agents through a device, a chatbot or an interactive voice response like when you are told to hold on the call when placing a call to a bank. Conversational agents, also known as dialogue systems, have risen in popularity over the past few years through the rise of the “Smart” era – smartphones, smart speakers, smart cars, smart homes, and the list goes on. Each of these AI-powered systems often interacts with human users using natural language such as text and speech and even more sublime features such as facial movements and gestures shaping a mode of interaction with humans and computers.
Unlike most motorized human activities, modeling human dialogue interactions can be complicated, as it reflects the diverse make-up of who we are as humans. Think about the interactions you’ve had this week; maybe you had a meeting with your boss, a little chat with the cashier at the grocery store, your sibling in a different language, or your pet. The way you interact with each of these entities depends on the context and your relationship with that person, which shapes the conversations you have with them. All these differences portrayed by just one person make it difficult to personalize our interactions with machines in a way that is natural to us.
To understand how having a natural conversation with computers works, let us take a look at the components that current spoken dialogue systems use to make interaction possible. As an example, we will start by identifying a constrained task-oriented use case for our dialogue systems. Imagine the following situation:
The ESRs are meeting in Edinburgh for a weeklong workshop, we have just finished a day filled with fun lectures, and we are looking forward to unwinding with some food and drinks. To get some recommendations, one of the ESRs speaks with our virtual assistant, Venus.
A Spoken Dialogue System Diagram (Serban et. al 2015)
An Overview of the Components of a Spoken Dialogue System
- Automatic Speech Recogniser (ASR): Using a smart device of your choice, you will communicate with an opening greeting to the system Hi Venus, we need a restaurant recommendation in Edinburgh for tonight. To understand your speech input, the system utilises a speech recognition system. The ASR converts your speech into text.
- Natural Language Understanding (NLU): We could also express the request differently. Alternatively, we could have asked Venus – Hi Venus, any place to eat and chill around Edinburgh? or Where can we dine-in Edinburgh this evening? All three statements, though different, have loosely the same meaning. This may be intuitive to human listeners, but the computer needs to extract the intention and key information from the statement. The NLU is able to identify the Intent: Request for restaurant recommendations and the Entities: Restaurants and Edinburgh.
- Dialogue State Tracker and Response Selection: This component manages the state of the interaction given the intents and entities from the NLU. It decides what the best course of action is. Perhaps it already has some recommendations, or it needs to ask some follow-up questions about cuisine choice or atmosphere to make an even better recommendation. These selections will come out in the form of building blocks like [cuisine-choice] or [size-of-group].
- Natural Language Generation (NLG): This component takes the building block from the response selection and generates meaningful and coherent sentences in text. A hack in this step could be to have predefined templates in the system such as What type of [cuisine, atmosphere] will you like to have? Or I am sorry, but there is no availability for [fifteen] people.
- Text-to-Speech (TTS): Finally, the system generates speech from the generated text as a follow-up to the user utterance. Ideally, these systems should generate speech in a way that is understood correctly by the user.
Conversational AI systems is a fascinating field of research, a melting pot of so many academic disciplines; the opportunities are enormous and it’s such an interesting way to uncover what we know about human-to-human interaction to better facilitate our interaction with computers. Although widely used already, there are some ethical and moral issues that some of this technology could have that are not fully addressed yet, including bias, privacy, accessibility and misuse of technology for malicious intent. We hope that there will be more interdisciplinary efforts to address most of these issues.
If you want to know more about our projects and the ESRs working on them, please look under the Training tab.
Serban, I. V., Lowe, R., Charlin, L., and Pineau, J. (2015b). A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742.
Featured image by vectorjuice via freepik