HAAM Systems

July 1, 2026 · 15 min read · By Kris Haamer

The Pause Is Part of the Interface

Mariana Lin's work on Siri's personality, Apple's speech engineering, and OpenAI's real-time infrastructure reveal the same thing: interaction design does not sit on top of engineering. It is produced across the whole system.

A humanlike voice needs pauses. A humanlike conversation cannot tolerate the wrong pauses.

The character gives silence meaning. The infrastructure makes sure delay does not replace it.

Field media / AI for Good Global Summit

The Geneva panel where AI voice became character work.

These photos and the video excerpt come from the 2019 AI for Good opening panel on storytelling and AI systems. The clip centers on Mariana Lin discussing AI speech, responses, and the difficulty of designing dialogue for machines that meet unpredictable human language.

Opening panel slide for Can Storytelling Build Capacity in AI Systems at the AI for Good Global Summit
Opening panel at AI for Good Global Summit 2019: "Can Storytelling Build Capacity in AI Systems?"
Mariana Lin speaking on the AI for Good Global Summit opening panel in Geneva
Mariana Lin speaking during the opening panel on storytelling and AI systems.
AI for Good panelists discussing storytelling and AI systems on stage
The panel connected AI character, cultural storytelling, and how machines respond to people.
Video excerpt: Mariana Lin on AI speech, language, responses, and machine dialogue during the AI for Good storytelling panel.

Designed pause

Communicates intention

It gives speech rhythm, expresses character, signals thought and tells the listener whether a turn is ending or continuing.

System delay

Communicates failure

It makes the assistant appear absent, clips interruption and turns a conversation into a sequence of submitted requests.

A memory from Geneva, and the woman behind Siri's character

At the AI for Good Global Summit in Geneva on May 31, 2019, I watched the opening panel "Can Storytelling Build Capacity in AI Systems?" A moment from Mariana Lin stayed with me: why Apple's Siri had begun to feel more human was not only a matter of vocabulary or intelligence. It was also about the spaces between words.

The session photos now make the memory specific. Lin's own biography says she served as a creative director at Apple and the principal writer shaping Siri's voice and personality. The AI for Good programme around that summit likewise framed storytelling as a way to make AI more culturally aware and inclusive.

The documented record now reveals two complementary forms of authorship. A WIRED account describes Alex Acero and Apple's speech team engineering pauses, pitch and phonetic variation. Lin describes the character, beliefs, dialogue and silences that make an artificial voice feel like a particular presence. One explains how the voice is produced. The other asks who is speaking.

That distinction matters. A firsthand memory can begin an article without becoming permission to collapse several people's work into one convenient attribution. The more interesting story is that Siri's apparent humanity came from writers, speech researchers and engineers shaping the same conversational moment.

The voice had a writer

Lin did not approach Siri as a database with jokes added to it. She approached artificial intelligence as character design. She spent years scripting lines for Siri and the humanoid robot Sophia, knowing that the human participant could say almost anything and that the character would have to remain recognizable across thousands of unfinished scenes.

Her central point is easy to miss in product work: an AI has a personality even when nobody deliberately designs one. People infer a point of view from what the system answers, refuses, misunderstands, remembers and repeats. A neutral assistant is still performing a character, usually the character of an efficient institution.

This expands the meaning of voice. Voice is not only timbre, accent or synthetic speech quality. It is also a pattern of attention. It includes what the system finds funny, how it reacts to abuse, whether it admits uncertainty and whether it treats an unexpected question as an error or an invitation.

TED describes Lin as one of Siri's principal voice writers. Her work makes the microphone, language model and speech synthesizer look less like a stack of separate features and more like the stage machinery supporting one continuing performance.

Speech is the last layer

Lin argues that speech should be designed last. Before dialogue comes an origin story, a defined function, a belief system and what she calls the AI's telos: the stable purpose that organizes its behavior.

This is more rigorous than giving a chatbot three adjectives and calling the prompt a personality. An origin explains who made the assistant and what culture shaped it. A function defines why it deserves attention. Beliefs guide decisions when no script covers the situation. Telos keeps those behaviors coherent as the system encounters new data.

Lin offers the example of a personal finance AI that believes time is more precious than money. That belief sits slightly sideways to the product's obvious function, which is exactly why it can generate more distinctive advice. A financial companion built around sustainable consumption could likewise believe that money is stored possibility, or that a good choice should remain possible without moral perfection.

Once those foundations exist, the pause acquires character. Silence before a difficult answer might communicate care. Silence after a factual question might communicate uncertainty. The timing is no longer decoration around the words. It is one way the system reveals what it values.

The pause was not empty

A synthetic voice that moves at a perfectly uniform speed does not sound precise. It sounds assembled. Human speech stretches, accelerates, hesitates and settles. We lengthen a syllable before stopping. We raise our pitch when asking a question. We use silence to show that a thought has ended, or a filler to show that it has not.

Consider the word watch in two sentences: You want to watch this? I like your watch. The written word is identical, but its conversational function changes. In the first sentence the pitch tends to rise. In the second it falls. Reusing the same sound would be linguistically correct and socially wrong.

Apple's work treated phonemes as context-sensitive material. The system needed versions that could trail at the end of a word, arrive more firmly at the beginning, rise inside a question or become longer before a pause. The voice became more natural because engineering could represent the variation that writers and designers wanted people to feel.

Silence can create presence

In a graphical interface, space separates elements. In a voice interface, time separates intentions.

Lin adapts Martin Buber's relational framework into three modes for AI. An assistant can be useful, entertaining or capable of creating a moment of connection. She argues that function without delight feels cold, while facial expressions, gestures and silences can help a person experience mutual presence.

A short silence might therefore mean that I am thinking, that my sentence has ended, that I am waiting for you, that I have not finished, or that the connection has failed. The acoustic event can be almost identical while the relationship it communicates is completely different.

Research on filled pauses shows that sounds such as uh and um can help a speaker hold the conversational turn. Their position and prosody influence whether a listener expects the speaker to continue. What looks like verbal noise can carry coordination data.

Human conversation contains hundreds of these tiny signals. Most disappear from conscious attention precisely because they work. A voice assistant has to reproduce enough of that structure to feel attentive while avoiding pauses that communicate nothing except system delay.

The interesting path is also an interface

Most conversational products are designed around a happy path: the user expresses a recognizable intent, the system completes a task and the exchange ends. Lin questions whether that model can contain real conversation at all.

Ideal dialogue differs across cultures, languages, genders and identities. People ramble, change subjects, speak metaphorically and ask questions without wanting a transaction. Lin compares writing for AI to writing an absurdist play in which the writer knows one character's goals but cannot control what anybody else will say.

Her fear is not only that machines will fail to understand us. It is that successful machines will train us to reduce speech to commands. An assistant that always drags dialogue toward completion may be efficient while quietly flattening the way people speak.

The design goal is therefore not to eliminate every detour. It is to distinguish an interesting pause from a broken one, an open-ended exchange from a failed intent and cultural variation from noise. Low latency should protect conversational possibility, not turn every conversation into a faster checkout flow.

A designed pause and a delayed packet are opposites

This creates the central contradiction of voice interaction. A humanlike voice needs pauses. A humanlike conversation cannot tolerate the wrong pauses.

The system must preserve intentional silence while removing accidental silence. One can make a voice sound thoughtful. The other makes it feel absent.

OpenAI describes network performance in explicitly experiential terms. Latency and jitter appear to users as awkward gaps, clipped interruptions and delayed barge-in. The model may understand every word and produce a strong answer, yet the rhythm tells the user that the interaction is broken.

That failure cannot be repaired with a nicer waveform animation. It is produced below the visible interface, inside transport protocols, routing decisions, buffering and model execution.

Milliseconds change the product category

OpenAI streams audio continuously so transcription, reasoning, tool use and response generation can begin while the person is still speaking. Without that stream, the system waits for a completed recording, processes it and returns an output.

The same microphone button can therefore describe two different products. At low and stable latency, the system feels like a conversational participant. With long gaps, it feels like push-to-talk radio or a voice-operated form.

Nothing important needs to change on the screen. The product category changes through timing alone.

This is why latency is not only an engineering metric. It is an interaction-design material, comparable to typography, motion, sound and physical resistance. It determines what kind of relationship the interface can sustain.

Architecture becomes perceptible behavior

To keep real-time audio responsive at scale, OpenAI separated packet routing from the stateful work of terminating WebRTC sessions. A thin relay receives traffic close to users. A transceiver owns the session state and the heavier protocol work. Routing information is carried in an existing WebRTC credential so the relay can direct the first packet without adding a database lookup to the critical path.

These are infrastructure decisions, but every one of them protects an interaction quality. Geographic proximity shortens waiting. Stable ownership preserves continuity. Fast routing makes interruption responsive. Low jitter keeps a voice from breaking into fragments.

The architecture does not sit behind the experience in any meaningful sense. It produces the experience that the user perceives.

A relay close to the user

The conversation starts before waiting becomes noticeable.

Stable transceiver ownership

The voice does not lose continuity halfway through a session.

Routing from the first packet

An interruption can feel immediate instead of becoming another queued request.

A point-to-point media model

The product is optimized around one person and one artificial conversational partner.

Engineering contains a theory of conversation

OpenAI's architecture is optimized primarily for one person speaking with one model, rather than for the multiparty media routing common in group video calls. That technical choice contains a social assumption: one human, one artificial voice and one shared conversational turn.

A classroom with several students and an AI tutor would need speaker identification, overlapping-speech management and decisions about whom the system should address. A conversation involving several agents would require rules for attention, interruption and authority. A customer-service handoff would need continuity across artificial and human participants.

Lin's work adds another layer. The system must also decide what kind of conversational partner it is, which values remain stable and how its behavior changes across cultures without losing its core character.

These are interaction-design, writing and systems-engineering questions at the same time. Changing the social situation can require changing the media architecture, dialogue system and character model beneath it.

The interface extends below the screen

Interaction design is still often presented as the design of screens, flows and visible feedback. Voice AI makes the limits of that definition obvious.

Its materials include origin stories, beliefs, phonemes, pitch, silence, turn detection, packet loss, buffering, inference speed, session state and geographic distance. Users do not inspect these mechanisms. They experience their consequences.

A pause before a difficult answer can make a system seem thoughtful. The same pause after an interruption can make it seem deaf. A stretched syllable can give a sentence shape. A delayed packet can make the same sentence stutter.

The interface is not only where system behavior is displayed. The interface is the behavior.

Writers, designers and engineers are shaping the same moment

The familiar product-development story says that designers define an experience, writers fill it with words and engineers implement it. Voice interaction offers a more accurate model: all three jointly determine which behaviors are meaningful, perceptually convincing, technically possible and reliable at scale.

Lin's framework gives the assistant an origin, function, beliefs and telos. Apple's speech work gives that character prosody and context-sensitive sound. OpenAI's infrastructure moves encrypted audio quickly enough that unintended delay does not overwrite the intended performance.

One discipline authors the reason for the pause. Another synthesizes its shape. Another delivers it on time.

Together they reveal a broader principle for intelligent products: every layer that changes timing, feedback, identity or behavior is part of the interaction design. The most important interface decision may be inside a character brief, a speech model, an ICE credential or a few hundred milliseconds the user never consciously notices.

When the whole system works, the pause can feel like attention. When one layer fails, the same pause feels like a machine.

Help improve this website?

Optional Google Analytics and Microsoft Clarity measure content performance and usability. They load only if you allow them. Form values, email addresses, and chat messages are never included in analytics events.