Building Real-Time Voice AI in React Native

Conversational latency is the difference between a tool and a toy. Here's how the mobile ecosystem is pushing toward 800ms voice loops.

Real-time voice AI in React Native requires keeping end-to-end conversational latency under 800ms. The architecture that achieves this combines on-device Voice Activity Detection (VAD) to cut silence, binary WebSocket audio streaming to eliminate base64 overhead, and WireAI to translate the transcribed text into native UI components, not just a text response.

Voice is the most natural interface for an AI agent, but push-to-talk with a 3-second wait feels like a broken walkie-talkie. The gap between acceptable and unusable voice AI is roughly 800ms, under that threshold, the interaction feels conversational. Over it, users consciously notice the lag and stop trusting the agent.

Why standard audio recording fails for real-time AI

The typical React Native audio implementation follows this path: start recording → user finishes speaking → stop recording → save file to disk → upload file via HTTP POST → wait for response. This approach introduces 4–6 seconds of latency on a good connection. The bottlenecks are the file write (50–200ms), the HTTP overhead (300–800ms), and the transcription wait (500ms–2s for Whisper API). None of these are the LLM, the LLM hasn't even seen the audio yet.

Fixing this requires attacking each bottleneck: eliminate file writes with in-memory streaming, eliminate HTTP overhead with WebSockets, and eliminate transcription wait with parallel processing.

On-device VAD: only stream when the user is speaking

Voice Activity Detection determines when speech starts and ends in real time, without round-tripping to a server. Running VAD on-device means you open the microphone stream only when the user is actively speaking and close it automatically when they stop. The benefits:

Lower bandwidth: You only send audio bytes that contain speech, not silence. A 10-second recording might contain 3 seconds of actual speech.
Cleaner transcripts: Whisper hallucinates text from silence and background noise (HVAC hum, keyboard clicks). VAD-gated audio cuts these transcription artifacts significantly.
Natural barge-in: The agent can detect when the user starts speaking mid-response and interrupt itself, the natural behavior of a real conversation.

The @picovoice/cobra-react-native library provides on-device VAD with a free tier. For open-source alternatives, silero-vad runs locally via ONNX but requires a development build with native modules.

Binary WebSocket streaming: remove the base64 tax

React Native's fetch API and XMLHttpRequest both encode binary data as base64 strings when crossing the bridge. A 16kHz PCM audio chunk that is 32KB of raw bytes becomes ~43KB as base64, a 33% overhead on every audio packet sent to your transcription server. At 10 packets per second, that overhead adds up.

The solution is a native WebSocket that handles binary frames directly. React Native's built-in WebSocket class supports binaryType = 'arraybuffer' since RN 0.72. This sends raw PCM bytes over the socket without serialization overhead:

const ws = new WebSocket('wss://your-whisper-server/stream');
ws.binaryType = 'arraybuffer';

// Send raw PCM audio chunk directly
function sendAudioChunk(pcmBuffer: ArrayBuffer) {
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(pcmBuffer); // no base64, no bridge overhead
  }
}

// Receive transcription result
ws.onmessage = (event) => {
  const { transcript, isFinal } = JSON.parse(event.data);
  if (isFinal) {
    // Send transcript to WireAI for component selection
    sendMessage(transcript);
  }
};

Connecting voice to WireAI

Once the transcript arrives, pass it to WireAI's sendMessage exactly as you would a typed message. The agent processes the transcribed text, picks a component from your registry, and renders it natively. The user spoke a request and got back a tappable native card, no text wall.

import { useWireAIThread } from 'wireai-rn';

export function VoiceAgentScreen() {
  const { sendMessage, messages } = useWireAIThread();
  const [isListening, setIsListening] = React.useState(false);

  const handleTranscript = (transcript: string) => {
    setIsListening(false);
    // Voice input feeds into the same WireAI pipeline as typed input
    sendMessage(transcript);
  };

  return (
    <View style={{ flex: 1 }}>
      <WireAIMessageList messages={messages} />
      <VoiceInputButton
        isListening={isListening}
        onPressIn={() => setIsListening(true)}
        onTranscript={handleTranscript}
      />
    </View>
  );
}

End-to-end latency breakdown

A well-optimized voice AI pipeline on a modern phone over WiFi looks like this:

VAD end-of-speech detection: 80–120ms after the user stops speaking.
Audio streaming to Whisper server: 50–150ms (binary WebSocket, final chunk flush).
Whisper transcription (cloud): 200–400ms for a 3-second audio segment.
WireAI LLM component selection: 300–700ms (GPT-4o Mini or local Llama 3).
React Native render: 10–30ms for component mount.
Total: 640–1,400ms, within the acceptable range for most interactions.

The biggest win comes from running Whisper locally on a server you control (not the OpenAI API), which cuts transcription latency to 100–200ms. Combined with a fast LLM like GPT-4o Mini, sub-500ms end-to-end is achievable on a good connection.

Build voice-first AI apps. Run npm install wireai-rn to start.