Skip to content
AIRN

The Hermes ReadableStream Problem (and How to Fix LLM Streaming on React Native)

Malik Chohra

Malik Chohra

May 22, 2026 · 5 min read

Hermes does not implement ReadableStream on fetch, which silently breaks every cloud LLM SDK that calls response.body.getReader(). This is the production XHR + SSE fix, with the exact code WireAI ships, and how to avoid the three traps that ate three days of my life.

If you are trying to stream LLM responses in a React Native app and your tokens never arrive, the cause is almost always this: Hermes does not implement ReadableStream on fetch. Every cloud LLM SDK that calls response.body.getReader() dies silently. The fix is an XMLHttpRequest-based Server-Sent Events reader that streams tokens at roughly 60fps on a real device. This post shows the bug, the workaround, and the production code WireAI ships in packages/core/src/streaming.

It is a small fix on paper and three days of "wait, why is this not working" in practice.

What you need before starting

You should be comfortable with React Native (Hermes runtime, not JSC), one cloud LLM SDK that supports streaming (OpenAI, Anthropic, Gemini), and basic Server-Sent Events knowledge (the data: ...\n\n format).

Estimated time to wire this up from scratch: 90 minutes if you have done SSE before, 4 hours if you have not. Or zero if you just install wireai-rn@0.1.3.

Step 1: Reproduce the bug

Drop the standard OpenAI streaming pattern into a fresh RN app:

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY });

const stream = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  stream: true,
  messages: [{ role: 'user', content: 'Hello' }],
});

for await (const chunk of stream) {
  console.log(chunk.choices[0]?.delta?.content || '');
}

On web you would see tokens arrive one by one. On Hermes you will see one of three things, depending on RN version: TypeError: response.body.getReader is not a function, the stream hangs forever waiting for a reader that does not exist, or the library swallows the error and never yields a chunk.

The root cause: fetch on Hermes returns a Response where .body is not a real ReadableStream. The polyfill RN ships is partial. Anything that does response.body.getReader() or for await (chunk of response.body) fails. This affects Anthropic's @anthropic-ai/sdk, OpenAI's openai package, Google's @google/generative-ai, and anything wrapping them. The open Facebook RN issue is #27741 if you want to follow the upstream conversation.

Step 2: Confirm it is the runtime, not your code

Check the runtime explicitly:

const isHermes = !!(global as any).HermesInternal;
console.log('Hermes:', isHermes);
console.log('ReadableStream:', typeof ReadableStream);
console.log('fetch body type:', (await fetch('https://example.com')).body);

If HermesInternal is truthy and ReadableStream is undefined (or body is null), you have found the problem. There is nothing wrong with the LLM SDK. The platform does not implement the stream type the SDK depends on.

Step 3: The XHR + SSE workaround

The fix is to use XMLHttpRequest instead of fetch for streaming endpoints, set xhr.responseType = 'text', and parse Server-Sent Events from the onprogress callback as bytes arrive. XHR has full streaming support on Hermes. It is older than fetch, less elegant, and works.

Here is the core reader, simplified from what WireAI ships in packages/core/src/streaming/xhr-sse.ts:

type SSEHandler = (event: { data: string }) => void;

export function streamSSE(
  url: string,
  headers: Record<string, string>,
  body: string,
  onMessage: SSEHandler,
  onDone: () => void,
  onError: (err: Error) => void,
) {
  const xhr = new XMLHttpRequest();
  xhr.open('POST', url, true);
  Object.entries(headers).forEach(([k, v]) => xhr.setRequestHeader(k, v));
  xhr.setRequestHeader('Accept', 'text/event-stream');

  let cursor = 0;
  let buffer = '';

  xhr.onprogress = () => {
    // responseText accumulates; only parse the new tail.
    const tail = xhr.responseText.slice(cursor);
    cursor = xhr.responseText.length;
    buffer += tail;

    const parts = buffer.split('\n\n');
    buffer = parts.pop() ?? '';

    for (const part of parts) {
      const line = part.split('\n').find((l) => l.startsWith('data: '));
      if (!line) continue;
      const data = line.slice(6);
      if (data === '[DONE]') {
        onDone();
        return;
      }
      onMessage({ data });
    }
  };

  xhr.onload = () => onDone();
  xhr.onerror = () => onError(new Error('Stream failed: ' + xhr.statusText));
  xhr.send(body);
}

Three things worth noticing in that code:

  • responseText accumulates. XHR keeps the full response in memory. The cursor tracks position so we only parse the new tail each progress event. Without that, you reparse everything on every progress tick.
  • SSE buffering across chunks. A data: line can land split across two progress events. The buffer variable holds the partial last message until the next \n\n arrives.
  • The [DONE] sentinel. OpenAI-compatible SSE streams end with data: [DONE]. Other providers use different sentinels (Anthropic uses event: message_stop). Adapter-specific.

Step 4: Wire it to the LLM adapter

For OpenAI-compatible endpoints:

streamSSE(
  'https://api.openai.com/v1/chat/completions',
  {
    Authorization: `Bearer ${apiKey}`,
    'Content-Type': 'application/json',
  },
  JSON.stringify({
    model: 'gpt-4o-mini',
    stream: true,
    messages: [{ role: 'user', content: 'Hello' }],
  }),
  ({ data }) => {
    const parsed = JSON.parse(data);
    const token = parsed.choices[0]?.delta?.content;
    if (token) appendToken(token);
  },
  () => setStreaming(false),
  (err) => console.error(err),
);

For Anthropic, the event format is richer (event: content_block_delta lines alongside data: lines). The Anthropic adapter in WireAI parses both per message. Otherwise, same XHR + SSE shape. For end-to-end Claude integration on RN see the Claude API guide.

Step 5: Measure the performance on a real device

I logged frame times in the WireAI ChatScreen demo on a real iPhone 13, streaming roughly 200 tokens from gpt-4o-mini. Tokens arrived at roughly 25 to 40 per second (model-bound, not RN-bound). Frame time stayed under 16ms (60fps) during streaming. JS thread CPU stayed under 30% during stream parsing. These are dogfood logs, not a controlled benchmark suite; the order-of-magnitude is reliable, the exact numbers are not.

The XHR approach is not slower than fetch streaming in any measurable way. Just less elegant. The phone does not care.

Step 6: Avoid the common traps

Three traps that ate hours of my life:

  • responseText size blow-up on long conversations. For very long responses, xhr.responseText grows large. Use the cursor pattern above. Do not JSON.parse the whole responseText on every progress event.
  • Missing Accept header. setRequestHeader('Accept', 'text/event-stream') matters. Some providers fall back to non-streaming if it is absent. Anthropic in particular.
  • HTTP/2 buffering on proxies. If you are behind a corporate proxy or a CDN, SSE messages can buffer up before reaching the client. There is no client-side fix. Test on a direct mobile network, not corporate Wi-Fi.

A fourth one if you are rolling your own adapter: do not forget the [DONE] sentinel for OpenAI-compatible endpoints. If you do not handle it, your UI will hang waiting for one more token that never arrives.

Step 7: When to use Wire RN vs roll your own

Use the XHR + SSE pattern above directly if you are streaming text into a single chat bubble (no component switching), you have exactly one LLM adapter, you do not need agent protocol support, and you do not mind reimplementing this for the next project.

Use Wire RN (wireai-rn@0.1.3) if you want streaming plus structured component output (LLM emits JSON, native component renders), you need multiple adapters (OpenAI, Anthropic, Gemini, Ollama, A2A), you want Zod-validated props, and you do not want to write this twice. The WireAI streaming layer is the production version of the above: same XHR + SSE pattern, adapter-specific parsing, error handling, retry logic. For a broader streaming overview see How to Stream LLM Responses in React Native.

What still does not work

  • HTTP/2 server push. Not handled. SSE only.
  • Reconnect on drop. If the network drops mid-stream, WireAI surfaces the error. No automatic resume yet (some providers do not support resume tokens anyway).
  • Background streaming. If the JS bundle is backgrounded, the stream pauses. Standard RN behavior, not a WireAI limitation.
  • Hermes on old RN. If you are on RN < 0.71, even XHR has quirks. Upgrade RN.

Where to start

npm install wireai-rn@0.1.3 zod

Streaming docs at getwireai.com/docs/quick-start. Glossary entry at getwireai.com/glossary/streaming-ui. Repo at github.com/chohra-med/wireai-rn (MIT). Wire RN is the AI tier of the AI Mobile Launcher boilerplate at aimobilelauncher.com. Weekly mobile-AI issue at codemeetai.substack.com.

Built by Malik Chohra. 7+ years React Native. Shipped DocMorris (9M users, regulated digital health, NFC for electronic health cards), Mindshine (4.3 to 4.9 App Store rating), and ScorePlay (AI spec workflows as App Lead).

FAQ

Why does Hermes not support ReadableStream on fetch?

Hermes ships a minimal fetch polyfill that does not include the WHATWG ReadableStream interface. The runtime team prioritized binary size and bridge compatibility over web-platform parity. The Hermes maintainers have acknowledged the gap (see RN issue #27741); until it is filled, libraries that depend on response.body.getReader() will not work natively on RN.

Can I use a ReadableStream polyfill instead of XHR?

You can try, but in my testing the available polyfills (web-streams-polyfill, others) do not play well with Hermes' fetch implementation. You end up with a polyfill on top of a polyfill, edge cases multiply. XHR is older, simpler, and already works. That is why WireAI's streaming layer uses it.

Does this apply to Expo?

Yes. Expo Go and bare Expo both use Hermes by default in recent SDKs (51+). The same ReadableStream gap applies. The XHR workaround works identically in Expo. If you are on wireai-rn@0.1.3, the shim is built in.

How fast is XHR-based SSE streaming on mobile?

In our dogfood logs on iPhone 13 with gpt-4o-mini: tokens arrive at roughly 25 to 40 per second (model-bound), frame time stays under 16ms (60fps), JS thread CPU stays under 30%. The bottleneck is the model and the network, not the XHR parser.

Does WireAI handle Anthropic's streaming format?

Yes. Anthropic's SSE format uses event: lines alongside data: lines (content_block_delta, message_stop). The WireAI Anthropic adapter parses both. From a developer's perspective you configure the adapter and the parsing is invisible.

What about WebSocket-based streaming?

WebSocket on Hermes works (RN ships a usable WebSocket implementation). Some providers offer WebSocket-based streaming as an alternative to SSE. If your provider supports it, that is a valid path. SSE remains the most widely supported across OpenAI, Anthropic, Gemini, and self-hosted Ollama-compatible endpoints, so WireAI defaults to SSE.