The Role of Voice Activity Detection in Low-Latency AI Conversations

Voice Activity Detection

We all know now is the AI agent, or conversational AI era. So with the rise of AI agents, VAD, the voice activity detection, has become even more crucial. It helps the system quickly understand when a person starts speaking and when they stop speaking. So we need systems that can quickly spot the SOS (the start of sentence) and EOS (end of sentence), with lower computational load and smaller library size. 

Also, the VAD system has to handle false negatives and false positives well. So, in other words, the VAD system should have high performance and should really be agent-friendly, and in addition, it should be able to detect the short non-speech segments between the two separate sentences in order to lower the end-to-end latency, and existing VAD solutions just aren’t cutting it in some aspects.

Why Traditional VAD Solutions Struggle With Latency

The traditional energy-based VAD, or the pitch-based or GMM-based VAD from WebRTC, are not robust against noise, thus resulting in a high false alarm rate. So we cannot use them in the voice AI agent. And the SOTA neural network-based VAD called zero VAD has high latency. Well, the latency is not tolerable at all in voice AI systems.

For example, if the end of sentence cannot be detected quickly, then the users like you, who are on call with the AI agent, will always wait for the agent to give you an answer, give you a response. But after one or two seconds, if you don’t hear any response from the agent, you will be angry and say something continuously.

How VAD Directly Impacts Conversational Latency

VAD essentially acts as the traffic controller here by detecting both the start of sentence (SOS) and end of sentence (EOS), as I mentioned before. It is what triggers the system exactly when to begin processing a response and when to move to the next step in real time.

For example, VAD detects a pause like 200 milliseconds of silence; as the end of a sentence. This time can be adjusted based on how responsive or cautious you want the AI to be. Once VAD detects EOS, it signals the Speech to Text system, to finalize the transcription. This transcription will then be sent to the LLM, and the LLM output then triggers TTS to synthesize a response. So this is why low VAD latency is very important in conversational AI applications.

The start of a sentence controls the interaction latency, and the end of sentence affects the whole end-to-end or response latency. So it’s very important.

Design Choices That Enable Low-Latency VAD

High-Quality Training Data

First of all, let’s talk about the training data. Use precisely manually labeled data for training rather than open-source data because the open-source data provides labels, but the labels are low precision, inaccurate or rough. So, do not use this data for model training because poor-quality labels train the system the wrong timing and patterns, resulting in delayed and wrongly timed responses. With such low-precision and cheap data, the VAD model will learn latency, and therefore the experience would be worse due to high latency.

The Role of Pitch Features

Pitch is very important in the whole model design. While training VAD, we can use pitch, or the fundamental frequency, as an input feature.

Because pitch is an important characteristic of our vocal cords and only when our vocal cords vibrate will they generate pitch or fundamental frequency. As only human voice, human speech, contains pitch, but noise does not, and  we can train the model to learn these characteristics. If the model detects the pitch is, for example, 100 hertz, then the model will think maybe someone is speaking. If the pitch is zero, then there must be no one speaking, and conclude that an audible voice is noise.

Diverse Real-World Training Data

To train VAD perfectly for low latency and accuracy, we can collect data from different scenarios and train it using that data. Voice data for training can be collected from different room configurations, with many different kinds of microphones, many different distances between the speaker and microphone, and different reverberation, also using the pitch feature.

Multitask Training

Lastly, we use multitask learning to train the model. So the VAD model not only does the voice activity detection task but also does other tasks. While its main job is voice activity detection, it also learns other supporting tasks that are closely related to audio and speech.

By learning these extra tasks together, the model understands speech patterns better, separates speech from noise more accurately, and becomes more reliable overall.

These design choices make VAD faster but the real performance also depends on where the model is deployed: cloud or edge

Edge Deployment and Its Impact on Latency

The VAD model is very small, only several hundred kilobytes. Typically, it can be run on edge devices to ensure low-latency detection of voice activity.

Users hate delays when interacting with agents. Running VAD on the edge just reduces the lag because it is right there on the device, detecting speech starts and stops instantly.

Another thing is bandwidth and cost savings. If the VAD is on the edge, it only sends audio frames that actually have speech to the speech-to-text, or ASR system. Because of this, it can help users save bandwidth costs, speech-to-text costs, and keep things efficient.

Edge deployment reduces latency, but it also raises an important question: can VAD run efficiently on mobile devices without consuming too much CPU?

Can VAD run efficiently on mobile devices without consuming too much CPU?

Voice activity detection is basically a simple binary task. It is about  the classification of two classes: speech and non-speech. It just needs to tell if there is speech or not. So that lets engineers keep the model small, with only a few hundred thousand parameters. So even on mobile, it barely uses any CPU at all.

The task’s simplicity lets us avoid that tough choice between efficiency and accuracy. It runs light on mobile but still catches speech reliably.

Now, we know VAD is lightweight, fast, and reliable but let’s discuss where it still struggles!

Known Limitations and Frontier Challenges

VAD is only responsible for detecting whether there is speech in an audio frame or not, while the voice agent itself is multilingual. VAD can be used across languages because core human voice characteristics, such as pitch, exist in all human speech. Although its frequency varies from person to person, the underlying vocal properties are independent of the spoken language.

Overlapping speakers do not seem to be a task of VAD. VAD just detects whether there is speech, but it will not know if there are two speakers, three speakers, or many speakers. It doesn’t really care about the number of speakers. It just says this is speech and passes it to the next step.

Whispered or weak speech detection is a big challenge. When someone speaks softly, like whispering, the audio signal is extremely low in energy. This makes it easy for VAD to mistake it for non-speech and miss it entirely.

With the development of deep learning, the model itself can learn the characteristics of whispered speech. Although the speech is low volume, the frequency structure, the spectrum structure, still exists. So the model can extract these features regardless of volume.

FAQs about Voice Activity Detection

1. What is Voice Activity Detection (VAD) in conversational AI?

Voice Activity Detection (VAD) is a system that detects when a person starts speaking and when they stop speaking. A conversational AI system needs it to quickly spot the SOS (the start of sentence) and EOS (end of sentence), with lower computational load and smaller library size.

2. Why is VAD critical for low-latency AI conversations?

VAD detects a pause like 200 milliseconds of silence; as the end of a sentence. Once VAD detects EOS, it signals the Speech to Text system, to finalize the transcription. This transcription will then be sent to the LLM, and the LLM output then triggers TTS to synthesize a response. So this is why low VAD latency is very important in conversational AI applications.

3. Why do traditional VAD solutions struggle in real-time voice AI systems?

The traditional energy-based VAD, or the pitch-based or GMM-based VAD from WebRTC, are not robust against noise, thus resulting in a high false alarm rate. So we cannot use them in the voice AI agent. And the SOTA neural network-based VAD called zero VAD has high latency. Well, the latency is not tolerable at all in voice AI systems.

4. Can VAD run efficiently on edge and mobile devices?

Yes. Voice activity detection is basically a simple binary task. It is about  the classification of two classes: speech and non-speech. It just needs to tell if there is speech or not. So, the small model with only a few hundred thousand parameters is enough and  it barely uses any CPU at all.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top