Spoken Language Understanding of Human to Human Conversations

Human-to-human conversations are a dynamic and interactive flow of information exchange, which are informal, verbose, and repetitive. This leads to lower information density and more topic diffusion since the spoken content of a conversation is determined by speakers, each with his/her own thought process and potentially distracting and parallel streams of thoughts.

The field of computational analysis for understanding such conversations is popularly called spoken language understanding (SLU). The majority of existing approaches for SLU operate in two stages,

  • The first stage of speech metadata extraction (SME) involves identifying all the speakers in the conversation, their respective start and end times and the content of their speech in the form of text.

  • The second stage of natural language understanding (NLU) involves the analysis of the text with the context of the conversation to extract relevant details of the conversation.

Some of the recent approaches are exploring SLU with just one stage, ie, given a conversational audio, all the key insights are extracted with a single model. This approach overcomes any cascading related problems of a multistage system.

In this blog, we will go through different blocks of a conversation analysis (speech analytics) system. Such systems can be used to analyse and create a summary of multi-speaker meetings. We can also use speech analytics to monitor call-centre conversations, to derive insights such as - are the agents treating customers well, which agents need training, what are the customers complaining about etc.

Speech metadata extraction


The sequence of steps involved in speech metadata extraction is visualized in the above figure. As an example, let us choose a call-centre conversation between an agent and a customer. The details for each of the steps are discussed below,

Call recording

The only input to a speech analytics system is the call recording. Some of the common challenges in call recordings are

  • Acoustic challenges

    • Single vs. Stereo channels: Some of the telephony recordings have the agent and customer voices in separate channels. This is an ideal scenario allowing easy identification of where the agent and customer are speaking. However, the majority of the telephony recordings have both agent and customer voices on the same single channel. You will need a diarization system for such single channel recordings to identify different speakers and where they are speaking within the recording.

    • Noisy backgrounds

      • There can be background noise on both the customer and agent sides depending on where they are calling from. Generally, agents have a high side-chatter noise on their end because multiple agents are sitting in a single room and speaking to customers. These kinds of noises can affect the ASR and further the NLU performance.

    • Multiple speakers

      • Some conversations can have multiple agents or customers depending on the complaint. Irrespective of the single or stereo recording, we will need a diarization system to help us figure out how many speakers are present in a call, and when do they speak.

  • Transmission and storage challenges: In order to enable faster transmission and reduced storage requirements, call recordings might be compressed, resulting in codec related artefacts which can further affect the SLU performance.

Language identification

The first step of SME is to identify the language of the call, this is especially an important block when the call-center receives calls in multiple languages. Identifying the language of each call enables us to first identify the volume of calls in each language, and further helps us choose the language-specific downstream models of speech-to-text (ASR) and SLU.

Diarization and voice activity detection (VAD)

As discussed in the call recording section above a diarization system is required whenever there are multiple speakers recorded on a single recording channel. The combined system of VAD + diarization together tell us - how many speakers are present, at what time-stamps do they speak, and where are the non-speech or silence regions. The speech regions are processed by downstream models to identify the spoken content. The non-speech regions are used to evaluate different aspects of the call - how noisy is the call? Is the noise from the agent's or the customer's side? what kind of noise is prominent? - This information gives us additional insights for improving the experience for the customers.

Gender identification

For each of the speakers in the call recording, we can identify their gender. This is an additional metadata that can be used for business insights from the call centre.

Speech-to-text or automatic speech recognition (ASR)

Finally, based on the detected language the corresponding ASR system is chosen to recognize the spoken text for each of the speech regions in the call recording.

Natural language understanding


The above figure visualizes some of the NLU items over the conversational output of SME. We will discuss each of these NLU items in detail below.

Speech turn

A speech turn is one instance of a speaker's dialogue. A conversation is made of multiple speech turns from all the participating speakers. For example, in the above figure, 'Good morning, how may I help you?' is the first speech turn, 'Hello. I am calling regarding my internet connection.' is the second speech turn. There are no duration criteria for speech turns, it can be as short as a single word, or can be multiple sentences long.

Speech dialog act detection

The segmentation of a conversation into speech turns is not really helpful for any application, we need a more meaningful way of segmenting the conversation. In the literature, there have been multiple proposals for this, and one of the popular ways is by using speech dialog acts. According to this, speech turns can be split into multiple meaningful segments based on the following four broad speech acts. Further, each dialog act can be classified into sub-speech acts.

    1. Constatives: Sentences that are making a statement

        • Answering, Claiming, Confirming, Denying, Disagreeing, Stating

    2. Directives: Sentences attempting to get the addressee to do something

        • Advising, Asking, Forbidding, Inviting, Ordering, Requesting

    3. Commissives: Sentences committing the speaker to future action

        • Promising, Planning, Vowing, Betting, Opposing

    4. Acknowledgements: Sentences expressing the speaker’s attitude regarding some social action

        • Apologizing, Greeting, Thanking, Accepting an acknowledgement


Post the speech dialog act detection, every speech turn is split into meaningful sentences, each categorized by the main speech act, and the sub-speech act.

Speech chapters, or keypoints identification

In a conversation, multiple consecutive speech turns can be about the same topic. For example, in the above figure, we see that the complete conversation can be split into five broad topics - opening, information verification, conflict situation, problem resolution, and closing. These topics can vary across calls, however, the total number of such topics is a small finite number. In the literature, this categorization of consecutive turns of the same topic is also referred to as speech chapters or speech keypoint identification. This kind of topic identification can help the NLU layer understand the context of the speech acts. For example, the sentence 'app is not opening', can mean different things based on what chapter it is detected under. For instance, if the detected chapter was 'conflict situation' then the sentence was describing the conflict. However, if the detected chapter was 'closing', it can mean that the customer was simply enquiring about some possibilities beyond the main problem.

Call topics or themes identification

The call topic or theme is a call level metadata. For a given call, we identify the topic of the call. This will help us understand the volume of calls for different call-level topics.

Sentiment or emotion recognition in conversation

Recognizing emotion in a standalone speech act does not work for conversations. Context plays a huge role in conversations. Check the above figure for an example. The ideal way to recognize the emotion would be by using the text from consecutive turns of the conversation, and additionally including the acoustic information.

Named entity recognition (NER)

Check the above figure, given a sentence we want to identify words that fall into the entities of our interest. Detection of these entities and the entity-items is critical for any speech analytics system. Both the ASR and NER systems have to be tuned to get the best recall and precision of these words.

Intent recognition

Intents are speech acts that have business insights. For example, for the business insight 'did the agent introduce the company while greeting?', you would want to see the 'constative' speech act belonging to the 'opening' chapter, and check if the 'company name' entity was detected in the NER module. This is rule-based and can become messy for products at scale. Alternatively, if you have sufficient examples for each of the intents, you could train a standalone model for intent recognition.

Coreference resolution


Automatically identify expressions in the conversation that refer to the same entity.


Text: I still remember [our teacher], [Mrs. Jackson]. [She] was [our first grade teacher] and [she] always motivated me.

Coreferences: [our teacher] - [Mrs. Jackson] - [she] - [our first grade teacher] - [she]

In the above example, we see that 'Mrs. Jackson' was referred to in more than one way. The coreference task helps identify what the speaker is referring to. This module is critical for the summarization and comprehension tasks.

Dialogue or conversation summarization

This is the task of summarizing the entire conversation in a couple of sentences. There are two varieties of algorithms - the extractive variety simply identifies important sentences in the conversation, and the abstractive variety summarizes the conversation into custom text. Abstractive variety is more complicated to model than extractive.

Comprehension on conversation

The comprehension task involves answering free-form questions about the conversation. The answers to some of these questions could be spread across multiple speech turns, your model should be able to answer such questions. This is a fairly complex task, more complicated than the summarization task.