Back to blogVoice AI

The difference between an AI moderator and a chatbot (and why it's bigger than you think)

A chatbot trades typed turns. A voice moderator handles turn-taking, endpointing, latency, and barge-in, and the gap decides whether your data is any good.

Mattias Sjölunder
Mattias Sjölunder
Co-Founder & CTO · May 9, 2026 · 7 min read

If you have used a customer-service chatbot, you have a mental model for what "AI talking to a person" feels like, and that model is doing you a disservice when you think about AI interviews. A chatbot and an AI moderator look related on a slide. Under the hood they are solving different problems, and the gap between them is the difference between a transcript you can build a decision on and one you cannot.

I am the CTO at Nava Insights, so let me take you under the hood. The interesting part is not the language model. It is everything that has to happen in the half-second around it for a spoken conversation to feel like a conversation at all.

A chatbot trades turns; a voice moderator runs a conversation

A text chatbot has the easiest possible version of the timing problem. You type, you hit send, and the bot knows with certainty that your turn is over because you told it so. There is no ambiguity about when to respond, no penalty for thinking for a second, no risk of talking over you. The turn boundary is a button press.

Spoken conversation has none of that structure, and that is where the real engineering lives. People do not press send when they finish a thought. They trail off, they pause mid-sentence to gather a thought and then keep going, they say "um" and continue, they interrupt. A voice moderator has to handle all of that in real time, while also interviewing well. Those are two hard jobs at once, and a chatbot only ever has to do the second one in its easiest form.

Here are the pieces that make spoken interaction work, in plain terms:

  • Turn-taking is the basic rhythm of who speaks when. In conversation it is governed by subtle cues, not a send button, and getting it wrong makes the whole exchange feel broken.
  • Endpointing is knowing when someone has actually finished speaking versus just pausing to think. Cut in too early and you talk over them and lose the end of their thought. Wait too long and the silence gets awkward and the moment dies.
  • Latency is the gap between someone finishing and the moderator responding. In a real conversation that gap is short. If it stretches, the participant feels the lag, gets self-conscious, and starts performing for the machine instead of just talking.
  • Barge-in is letting the participant interrupt. When someone jumps in to correct or add something, the moderator has to stop talking and listen, the way a person would, instead of plowing through its scripted line.

None of these exist for a chatbot, because typing removes them. All of them have to be solved, well, for voice to feel human.

The hard part of a voice interview is not what the AI says. It is the timing of when it listens, when it speaks, and when it gets out of the way.

Why the timing decides the data

It would be easy to read all of that as polish. It is not polish. Each of those pieces directly determines whether a person opens up, and whether you can trust what they said.

Get endpointing wrong and you clip people. The moderator cuts in before someone reaches the real point, and the most valuable half of the answer (the part after the pause, where they were about to say the honest thing) never gets spoken. You do not see the data you lost. It simply is not there.

Let latency stretch and people stop being candid. A laggy, stilted exchange constantly reminds the participant that they are talking to a machine, and that awareness makes them shorter, more guarded, more careful. The interview turns into a form-filling exercise in spoken form, which defeats the entire reason to do a voice interview.

Ignore barge-in and you train people to stop trying. When someone interrupts to add something important and the moderator talks over them, they learn the system is not really listening, and they stop volunteering. The richest, most spontaneous contributions are exactly the ones that arrive as interruptions, and those are the ones a poor handler throws away.

So the conversational mechanics are not cosmetic. They are the gate that everything else passes through. The quality of the conversation drives the quality of the data, and the conversation is made of these small real-time decisions.

Interviewing well is the other half

Suppose you solve all of that and the turn-taking feels natural. You still have only built a system that can talk smoothly. Talking smoothly is not interviewing.

A good moderator does the harder, quieter job on top of the timing. It listens to what was actually said and probes that, rather than running down a fixed list. It asks adaptive follow-ups, because the planned question gets the surface and the follow-up gets the reason. It stays neutral, asking "what was that like" instead of "so that was frustrating, right," because a leading question puts words in the participant's mouth and quietly contaminates the data. And it follows the thread, recognizing when an answer has opened something worth pursuing and when a topic has run its course.

That is the part most people do not picture when they hear "chatbot." A chatbot answers your questions. A moderator asks the right next one, neutrally, in the moment, based on what you just said. One is built to resolve a request and end the exchange. The other is built to keep a person talking and draw out the why behind what they think.

Why the gap is bigger than it looks

Put the two halves together and the distance is large. A chatbot has the easy timing problem and the easy conversational goal: end the exchange. An AI moderator has the hard timing problem (turn-taking, endpointing, latency, barge-in, all in real time) and the hard conversational goal (probe, stay neutral, follow the thread, and keep someone genuinely talking). Nava is built for the second job, end to end, and none of it works if the first job is not also solved.

It matters because everything downstream inherits the quality of that conversation. The insights, the sentiment, the behavioral archetypes, the answers NavaGPT gives when you ask why a segment behaved the way it did: all of it sits on top of the transcript, and the transcript is only as good as the interview that produced it. A conversation that felt like a real one, where the person was relaxed and candid and never talked over, produces data worth analyzing. A clumsy one does not, and no analysis recovers what was never said.

So the next time something is described as an AI interviewer, ask the right question. Not "what model does it use," but "how does it handle the moment someone pauses, or interrupts, or trails off into the thing they actually mean." That is where the real work is, and it is the difference between a chatbot with a microphone and a moderator you would trust with your research.

Mattias Sjölunder
Written by
Mattias Sjölunder
Co-Founder & CTO

Mattias is Co-Founder and CTO of Nava Insights, where he leads the engineering behind the real-time voice AI that powers every interview.

Redo att komma igång?

3 intervjuer ingår gratis. Inget kreditkort krävs.