What Is Voice Biometrics and How It Works in Call Centers?

What is Voice Biometrics?

Voice biometrics is using unique properties of a speaker’s voice to either confirm their identity, (Authentication) or to identify them from a group of known speakers (Identification).

Authentication is a one-to-one comparison. It is about finding out whether the speaker that I’m listening to now sounds like the speaker they are claiming to be in my database. That is one-to-one and is a relatively easy computational task.

Identification is figuring out which of the potentially many speakers in the database this particular user is . That is a far more challenging and error-prone task than authentication.

In most processes, an identification for a user is claimed by some other means before the system attempts to use voice biometrics to authenticate that individual. In the vast majority of situations, people make a claim to their identity.

Explore More: What Is Turn Detection in Voice AI?

How Does Voice Biometrics Work?

Voice recognition via voice biometrics consist of four core steps:

Observe
Extract
Compare
Decide

Let’s understand them in detail!

1. Observe

In practice, when the speaker is speaking, that voice passes through a microphone. It’s then transmitted over the cell network and then finally received in your corporate network or in your phone system or by your contact center provider. Before the voice finally arrives features extractor, it is processed at multiple stages and gets polished .

The voice goes through some signal processing carried out within the device that will remove background noise and try to give you a very clean, high-quality signal.

The signal that the smartphone is capturing is better than the signal being transmitted across the telephone network, because the telephone network reduces the bandwidth to 8 kHz, whereas if you get it straight from source, from the mobile phone, it’s going to be much higher fidelity.

For voice biometrics the audio capture device is the microphone. The ubiquity and the cheapness of a microphone makes it a great choice for voice biometrics.

2. Extract

At this stage, the intelligent system extracts the key features from the voice that make each voice unique and help in tracing.

There are two main things that drive the variation in human speech.

physical characteristics
behavioral characteristics

Physical characteristics

Physical characteristics are built up of our body, and that’s very much driven by genetics. These characteristics include the length of your vocal tract, the size of your chest cavity, lung capacity, position of teeth. There are thousands of permutations and combinations that influence the sounds that people make when they try to say different words and the rate of change of frequencies between those.

The physical characteristics are important because people for whom you share genetics are going to share similar traits with you. They are not identical to you, but they are going to be more similar to you than the population as a whole.

Behavioral characteristics

Behavioral characteristics are things that you learn. They are more likely to change quickly than your physical characteristics. Some can change quickly and some take longer to change. People for whom you share demographics and other similarities are going to have similar behavioral attributes, but because of the multiplicity of these things that drive how your voice actually sounds, the comparison is fairly easy.

Voice biometric systems are designed to pick out the features that are most relevant for that calling population to allow you to identify particular individuals. The system is not measuring lung capacity or mouth cavity directly. It’s looking for the knock-on effects of all of those different physical and behavioral characteristics.

3. Compare

Now we have the features, we need to compare them. For comparison, the important thing is the reference voice with which we can compare the extracted features with.

For this purpose voices are enrolled to form a knowledge base for comparison. Companies can do it in two ways:

Enrolling a genuine customer: the company records the real voice of the customer and creates a voice print that contains the person’s unique voice features.
Enrolling a known fraudster: Company creates voice print for a known fraudster to add them to a watch list. If someone calls and their voice matches that stored fraudster voiceprint, the system can flag it.

Looking at the mechanism for comparison, there are two different methods.

text dependent
text independent

Text Dependent

In text dependent comparison, the system compares the way in which people say the same static passphrase or random number. In static passphrases, the system assigns a fixed phrase to the caller and asks them to repeat the exact phrase whenever they call. The system already has a voiceprint based on passphrase from previous interaction. It compares how the person says the phrase now with how they said it before. Whereas in a random number way, the system asks the caller to repeat a random number sequence and compares it with stored voice print.

Text Independent Method

In a text independent method, the system analyses how a caller speaks during normal conversation. In this method , the system compares the speech patterns, the frequency patterns, the pronunciation style, and the acoustic characteristics of the caller’s speech.

4. Decide

After the comparison is made, the outcome is not a yes or no but probability.

This is because my voice sounds different now than it did half an hour ago, and it might sound quite a bit more different than it did five years ago. The result of that comparison is a probability, not a deterministic outcome.

As an organization, we need to decide how confident we need to be. On probability score the comparison score lies between 0 to 100. That score shows how likely it is that the person speaking is really who they say they are.

A probability score of 100 means that audio in the database and this speaker sounds identical. And we never really sound identical more than once. For genuine speakers, the score is close to 100. But sometimes there is a long tail down toward zero because the callers may have a cold, they might be tired, in a noisy place, or the phone connection is poor

Most imposters who are trying to access that account on somebody else’s part, sound nothing like the real person and will tend to score very low. But the closer those imposters are in terms of genetics, accent, upbringing and locale, the more like the real person they start to sound.

That is where we see the overlap where the two lines intersect. That overlap represents the point where organizations must make a decision about how confident they need to be in order to take action.

What are False Rejects and False Accepts and why do they occur?

At some point, call centers especially dealing highly regulated fields like finance, insurance, health, etc. have to make a decision as to how confident they need to be in order to do the thing the customer is asking them to do in the case of authentication

For this purpose, they have to establish a threshold probability, a threshold score where agents are comfortable to consider that person to be who they claim to be, and underneath which they are not comfortable with that.

And that creates these two error types.

False acceptance: agents incorrectly reject the person who is who they claim to be
False rejection: agents incorrectly accept the person who isn’t who they claim to be

Managing these errors in the voice biometric process and to try and get this down to the minimal possible level is very important.

False accept and false reject occur because no authentication system produces a perfect yes or no outcome. All ID&V systems have a false accept and false reject.

Why False Accept Occurs

A false acceptance occurs when the system incorrectly accepts the person who isn’t who they claim to be. If somebody calls a telephone system linked to a PIN, claims to be the original person, and enters the right PIN, they would get access, and that would actually be a false accept because the system believed they were genuine when they were not.

Why False Reject Occurs

A false rejection occurs when the system rejects a genuine customer. If a customer calls and forgets the password, that’s a false rejection, because you’re rejecting a genuine customer. In voice biometrics more than 5%, sometimes more than 10% of callers are being restricted in the service because we haven’t managed to get them through an authentication process.

Why Setting a Threshold Automatically Creates Errors

When we think about performance in the context of voice biometrics, what we mean is:

minimizing the level of false acceptance for the smallest possible level.
minimizing the level of false rejection as low as possible (so we don’t block real customers).

But call centers can never reduce both to zero at the same time.

For operations, an organization must decide their appetite for false acceptance and rejection by setting a threshold score that makes it easier for agents to decide when to accept and when to reject the caller’s access to services. But once we set that limit, two types of mistakes automatically happen:

Sometimes a real customer will fall below the line (false reject).
Sometimes an imposter will score above the line (false accept).

These errors occur because the system is making a probability-based comparison and must draw a line somewhere. In practice, very few of your callers are imposters, and even those imposters may not be malicious in their intent. Often, it is far easier for a loved one or someone with caring responsibilities to pretend to be someone else because there is no other way of accessing the service.

When you add those buckets together, of all the people who are rejected by a biometric system, the vast majority of those will be genuine customers. That is why they should be referred to as a mismatch but not a failure. They occur because any system that sets a threshold to balance risk and access will inevitably produce both false accept and false reject outcomes.

How is Voice Biometrics more secure than traditional call center security processes?

Let’s discuss security with respect to the risk versus the ability to impersonate for different actors.

First, imagine a close family member, maybe even an identical twin sister. With traditional knowledge-based security mechanisms, a twin sister or sibling is likely to know things like your mother’s maiden name, your date of birth, what school you went to. Their ability to impersonate you is high. The good news is that family members represent a low risk. The chances of your brother or sister stealing from you is pretty low, so they are low-risk but have a high ability to impersonate under knowledge-based questions.

Next, consider a random stranger who finds a purse in the street. His ability to impersonate from knowledge-based questions is dictated by the information he can get from the purse, such as address, date of birth, or details on a driving license. His ability to impersonate is restricted by the contents of the purse. His desire to commit fraud might be low, but it is more than a family member.

Then consider a professional fraudster. These are the people you really have to worry about because their business model is to steal from others. They research their victims, scale LinkedIn and Facebook, and can know everything about someone. Their ability to impersonate at a knowledge-based question level can be quite high. So you have family members on one side at low risk but high ability to authenticate, and fraudsters on the other side at high risk and fairly high ability to impersonate.

When you introduce voice biometrics, the ability to impersonate drops across the board. Even if you have an identical twin, their ability to impersonate you drops dramatically. It may still be higher than a random stranger, but it is still low, and it is not a trivial exercise even for an identical twin.

The risk levels remain the same: family members are still unlikely to steal from you, and professional fraudsters still want to steal from you. But their ability to impersonate you when voice biometrics is in play drops dramatically. That is why you get much stronger security.

There are no perfectly secure security systems. But voice biometrics is more secure than conventional knowledge-based traditional authentication methods because it significantly reduces the ability to impersonate.

What are the main ways Voice Biometrics is used in the Contact Centre?

Voice Biometrics is used in the Contact Centre for Fraud prevention and Authentication.

The vast majority of use cases focus on authentication. The best form of fraud prevention is strong authentication. With strong authentication you also get usability and efficiency benefits that you might not be able to derive from fraud prevention alone.

But there are a range of situations where fraud prevention use cases are valuable. Particularly where callers are not very frequent and fraudsters are very frequent, or where there may be challenges creating that enrollment because of the type of use case.

In those cases, fraud prevention technologies with identification features, looking in watch lists or comparing between speakers to find bad actors, can be really valuable.

Implementation Approaches of Voice Biometrics

There are different implementation approaches of Voice Biometrics. Let’s explore three of them:

Automated authentication Voice Biometrics
Traditional passive or agent authentication Voice Biometrics
Hybrid or Passive Everywhere Voice Biometrics

Automated Authentication Voice Biometrics

In automated voice biometrics authentications callers are asked to repeat a passphrase to an automated system. It all takes place in automation, and therefore the customer can be retained in the automated system and go through to self-service.

In automated authentication everything takes place in automation, and the customer can remain in the automated flow. But they also have psychological disadvantages. Saying the same passphrase each time can signal that a security process is happening. Customers often dislike being asked to repeat branded phrases. Saying something even as innocuous as “My voice is my password” can make people uncomfortable, especially in open-plan environments.

It is also predictable from a fraudster’s perspective. It is a predictable response. While there are mechanisms to mitigate that, it remains predictable.

What the data shows is that accounts protected by voice biometrics, whether text dependent or text independent, suffer significantly less fraud. Fraudsters often skip those accounts and move to a weaker link. But compared to traditional knowledge-based authentication, this active passphrase-based system introduces both advantages and some usability and psychological challenges.

Traditional Passive or Agent Authentication Voice Biometrics

The other traditional use case is the passive use case. Authentication takes place in parallel with the agent conversation.

It is called passive because neither the customer nor the agent needs to do anything different. They just have the conversation. The customer explains what they are calling for, completes the normal identification step, or is identified using caller ID or ANI lookup. Authentication then takes place in the background while the agent continues servicing the customer.

It has tremendous advantages in terms of usability. It creates a strong customer experience because the customer does not have to do anything extra.

The challenge historically has been that the customer must already be speaking to an agent. The opportunity to automate or handle the customer’s needs through self-service gets lost. This is driven by the amount of audio required for text independent comparisons. Historically this meant 10 to 12 seconds of customer speech before enough audio was available to make the comparison.

Hybrid or Passive Everywhere Voice Biometrics

While enrollment audio is still typically captured through longer utterances, often with an agent, the technology has advanced to a point where two or three second utterances are sufficient to authenticate the customer. The short statements customers naturally provide when explaining why they are calling or during identification in an IVR are often enough.

In Passive everywhere voice biometrics, the implementation pattern is to enroll customers with agents and then use those voiceprints passively in IVRs, natural language understanding systems, and conversational AI.

The biggest change has been the amount of audio needed for verification. Historically 12 to 15 seconds were required, while recent live data shows that most customers can authenticate after two seconds of NET audio. That is approximately the same amount of audio required to say a short passphrase.

This means the passive algorithm can be applied everywhere. There is no longer a need to decide upfront whether authentication will be done with automation or with an agent. The technology can be applied consistently across channels wherever audio can be captured.

The passive everywhere use case brings together the advantages of passive enrollment, which is easier for customers and leads to higher adoption. If customers are not enrolled and voiceprints are not created, the value of the technology cannot be realized. That is why the hybrid passive everywhere approach is seen as the future, combining stronger security with better usability and broader channel coverage.

Why shouldn’t you forget your digital channels when implementing Voice Biometrics in the call center?

A huge number of customers are now digital first. They prefer to use the mobile app, interact via the website, or use online services for their day-to-day activities.

Those digital channels are convenience channels. People go to the website or mobile app to complete tasks quickly. If they cannot complete something digitally and have to call an agent, they are already slightly frustrated. If, on top of that, they must go through a torturous ID&V process, frustration increases further.

If customers have never spoken before and have never had the opportunity to enroll, then when they move from their digital “happy place” to a phone interaction, security becomes friction.

If enrollment can happen through the mobile app or website, then customers can later transfer to an agent and pass security seamlessly.

Digital channels often provide higher-quality audio because they skip the traditional phone network. That higher audio quality improves enrollment and authentication performance.

The goal is to make it easier to get customers enrolled, so that when voice biometrics is needed later, it works smoothly across channels.

FAQs about Voice Biometrics in Call Centers

Why is short utterance enrolment the final frontier for Voice Biometrics in the call center?

In short utterance enrolment, two seconds audio is good enough for authentication. If enrollment can happen with short utterances across different channels, then voice biometrics becomes much more powerful. The most obvious channel is the IVR. Many customers deal entirely with automation and never speak to an agent. If we can use the audio captured in the IVR to enroll customers, either in a single call or by aggregating audio over two or three calls, then enrollment no longer depends on an agent conversation.

What is the difference between Voice ID and Speaker ID?

Speaker ID is effectively identification. It refers to identifying who the speaker is, typically in a one-to-many comparison. You have a sample of the voice and multiple potential people it could be, and the system determines which one it is.

Voice ID tends to refer to the one-to-one comparison. The person is claiming to be someone specific, and the system checks, “Is this person who they claim to be?” That is authentication rather than identification.

How does Voice Biometrics handle authentication for trans users?

The system is gender agnostic. It does not care about gender. It only checks whether the current voice matches the voice that enrolled. For the vast majority of trans users, voice biometrics works well. There may be edge cases where medical transition involves changes to the vocal tract that fundamentally alter the voiceprint. In those rare cases, the user would need to re-enroll.

What Is Voice Biometrics and How Does It Work in Call Centers?