Voice, video, or text: choosing the right format for your interview study

The honest case for choosing voice, video, or text in qualitative research, and why Nava bet voice-first. Voice yields 4 to 5x more words than text.

Mattias Sjölunder

Co-Founder & CTO · March 12, 2026 · 6 min read

Every interview study starts with a choice you make before you write a single question: what modality you ask in. Voice, video, or text. It feels like a logistics decision, the kind you settle in a planning meeting and never revisit. It is not. The modality shapes how much people tell you, how honestly they say it, who can take part at all, and what your data is actually worth when it lands on your desk.

I am the CTO at Nava Insights, and I spend most of my time on the engineering behind voice interviews. So I want to be straight with you about the tradeoffs, including the ones that do not favor the bet we made. Then I will tell you why we went voice-first anyway.

What each format actually gets you

The three modalities are not just three input methods. They sit at different points on a curve that trades richness against effort.

Text is the lowest effort to administer and the easiest to analyze with a machine, because the words arrive already transcribed. It is also the thinnest signal. People type less than they speak, they edit themselves as they go, and the back-and-forth that surfaces a real reason gets compressed into short, careful replies. In our own work, voice answers run about 4 to 5x more words per answer than text. That is not a rounding difference. That is the difference between "the checkout was confusing" and a ninety-second account of where exactly someone got stuck, what they assumed, and what they did next.

Video sits at the other end. You get tone, face, hesitation, the pause before someone admits something. It is the richest channel on paper. It also asks the most of the participant. They have to be somewhere presentable, on camera, at a scheduled time, willing to be seen. That filters your sample toward people who are comfortable being watched, and it filters out a lot of honest, ordinary participants who would have talked freely if no one were looking at them.

Voice sits in between, and it is a more interesting middle than it first appears. You keep almost everything that makes spoken answers rich: the length, the spontaneity, the hesitations and emphasis that tell you where the real feeling is. You drop the one thing that makes people perform, which is the camera.

Voice keeps the part of an interview that carries meaning and removes the part that makes people self-conscious.

Candor, effort, and who gets to take part

There are three things I weigh on every study, and modality moves all three.

The first is candor. People say different things when they are not being watched. A camera introduces a small, constant awareness of being on display, and that awareness leaks into the answers, especially on anything sensitive, embarrassing, or socially loaded. Voice lets someone speak plainly about a frustrating product, a failed habit, or an awkward purchase without managing their face while they do it. You hear the truth in the voice. You do not put them on a stage to get it.

The second is participant effort, which decides who actually finishes. Every requirement you add (be on camera, be presentable, be at a scheduled slot) costs you completions and skews who remains. Voice asks for very little. Nava runs voice-only interviews with no camera required, so a participant can join from a kitchen, a commute, or a parked car, on their own schedule, in their own words.

The third is accessibility, which is partly an ethics question and partly a data-quality one. Text excludes people who do not type quickly or comfortably, including plenty who would speak fluently and at length. Video excludes people without a private, presentable space at the right moment. A narrower channel quietly narrows your sample, and a narrow sample is a biased one. Speaking is the most universal of the three. Almost everyone can do it well.

Here is the honest cost of voice, because I promised tradeoffs. Voice does not capture a facial expression, and there are studies (usability sessions where you need to see what someone is looking at, body-language research) where seeing the person genuinely matters. If your research question is fundamentally visual, voice alone will not give you everything. I would rather tell you that than oversell.

Analysis: turning a conversation into evidence

Modality also decides how hard it is to get from raw responses to something you can act on.

Text is trivial to process and thin to mine. Video is rich and genuinely hard to analyze at scale, because you are working with hours of footage and the signal is spread across words, tone, and image at once. Voice is the practical sweet spot. Spoken language transcribes cleanly, and the result is a high-richness record that a machine can actually reason over.

That is the part of the pipeline I work on most. At Nava, every voice interview becomes an evidence-based set of insights: themes, sentiment, and behavioral archetypes, each one traceable back to the specific quote and the exact position in the transcript where it came from. Nothing is a black box. When the system tells you a frustration is widespread, you can click straight to the eleven people who voiced it and read their words. You get the richness of a real conversation with the rigor of something you can audit, line by line.

Why we bet on voice-first

We could have built text interviews. They are simpler to engineer, and we would have shipped sooner. We did not, and the reasoning was the same set of tradeoffs I just walked you through.

Voice carries far more than text, by 4 to 5x in words alone and even more in the things words do not capture, like emphasis and hesitation. It lowers the barrier compared to video, with no camera anxiety and no need to be anywhere in particular. And it still feels human, which is the part that matters most, because people open up to a voice that listens and responds in a way they never open up to a blank text box. The quality of the conversation drives the quality of the data. Voice gave us the best conversation we could offer at scale.

To be clear about where we stand today: Nava is voice-first. We run voice-only interviews, and we do not offer video or text interview modes. That is a deliberate focus, not an accident, and I would rather do one modality genuinely well than three modalities adequately.

So when you plan your next study, start with the modality, not the questions. If your question is visual at its core, account for that. For most qualitative work (the why behind a behavior, the story behind a number, the reason someone really churned) you want people talking, in their own words, with as little standing between them and the truth as possible. That is the bet we made, and I am comfortable telling you exactly why.

Written by

Mattias Sjölunder

Co-Founder & CTO

Mattias is Co-Founder and CTO of Nava Insights, where he leads the engineering behind the real-time voice AI that powers every interview.

What each format actually gets you

Candor, effort, and who gets to take part

Analysis: turning a conversation into evidence

Why we bet on voice-first

Ready to get started?