September 17, 2025
I'm building Second Brain, a voice AI assistant composed of multiple agents: the conversational agent you talk to, a backend "brain", and separate document and email tools. The flow is simple in prose: the user speaks, the backend "brain" processes the request and calls tools and databases. Responses can arrive as several messages: intermediate status updates, a final answer, and occasionally unsolicited notifications triggered by backend events ("That email just arrived").
Vercel serverless functions feel like the natural place to put that brain: deploy it and call it. But as you try this in the real world, three problems emerge immediately:
In this post I'll walk through the requirements, the tradeoffs, and a few practical architectures. I focus on architecture decisions: what should run where (Vercel, a realtime layer, queues, state stores) and how messages should be passed between the frontend, the backend brain, and realtime components.
Before we dive into solutions, it's helpful to be explicit about what the "brain" needs:
If you sketch that out, you quickly see the mismatch with pure serverless HTTP functions: they are request→response. No persistent connection, no background processes.
SSE (Server-Sent Events) is a uni-directional, server→client streaming mechanism over HTTP. The client opens an EventSource
, the server responds with Content-Type: text/event-stream
and keeps the HTTP response open, sending framed events as lines like:
event: progress
id: 1
data: checking email
The browser's EventSource
API handles automatic reconnection and delivers those text events to your client. SSE is great for text-based server push, lightweight, and compatible with normal HTTP stacks (proxies, CDNs) that allow long-lived responses.
Important specifics for this post:
Pros:
EventSource
is baked into browsers.Cons:
If you have a handful of users and intermittent messages, SSE is perfectly fine. At scale, it's expensive and brittle unless you put the SSE connection onto a stateful service (containers, managed realtime providers).
// client.js
const es = new EventSource('/api/assistant/stream?session=abc');
es.onmessage = (e) => console.log('msg', e.data);
es.onerror = (err) => console.error('sse error', err);
Server-side you must format lines according to the text/event-stream
spec and keep the HTTP response open. Vercel's serverless functions aren't designed for long-lived connections, so for meaningful SSE you'll typically need either a streaming-capable edge runtime or to host the SSE endpoint on a service designed for long-lived connections.
Run a persistent WebSocket endpoint (e.g. AWS API Gateway WebSocket, a fleet of containers behind an ALB/NLB, or a managed provider). The frontend holds a socket, backend pushes to those sockets. For the purposes of this option, treat it as a pure websocket solution: the frontend connects to a long-running websocket endpoint that the backend uses for all server→client realtime messages.
Pros:
Cons:
This option keeps all realtime delivery on the websocket layer; Vercel can still host short-lived business logic, but any server-originated push goes through the websocket tier.
Modern platforms (including Vercel Edge Functions and many serverless runtimes) can stream responses via ReadableStream
or chunked Transfer-Encoding
. That solves intermediate messages for a single request: you can flush multiple partial responses over one HTTP call.
Pros:
Cons:
// api/assistant/stream.js - Vercel Edge
export default async function handler(req) {
const { readable, writable } = new TransformStream();
const writer = writable.getWriter();
writer.write(encode('event: init\ndata: checking email\n\n'));
processInBackground(async (progress) => {
writer.write(encode(`event: progress\ndata: ${progress}\n\n`));
});
writer.write(encode('event: done\ndata: all good\n\n'));
writer.close();
return new Response(readable, { headers: { 'Content-Type': 'text/event-stream' } });
}
That’s nice for one-off interactions but not a silver bullet.
The implementation keeps almost all brain logic in Vercel serverless functions while using a websocket as the realtime delivery channel for messages that cannot be handled synchronously.
Here's the overall setup:
Front-to-back requests from the voice agent to the brain are standard HTTP requests. The frontend calls a secured route such as /api/voiceToBrain
. That route validates and authorizes the request and then invokes the brain entrypoint brainGo()
.
The /api/voiceToBrain
route validates the request body (zod or similar) and calls brainGo()
. brainGo()
returns either a websocket message object or null
. If it returns a websocket message object and the request is in a mode that allows an immediate HTTP response, /api/voiceToBrain
returns that message in the HTTP response. If brainGo()
returns null
, /api/voiceToBrain
returns 200
and the voice agent expects the reply via the websocket.
The frontend tool (askBrain
) handles both outcomes: if it receives an immediate message in the HTTP response it treats that as the result; if it receives null
it returns a short explanatory message like "the brain is working, the response will arrive later" as the result of the askBrain
call so the voice agent knows that the call was successful but the result will arrive later. Note that this message is not spoken to the user, it is only given to the voice agent.
brainGo
is the central server-side function that implements the brain logic. It accepts userId
and recentContext
and returns a Promise<Message | null>
. Behavior:
null
immediately. This indicates that a brain is already running for that user. The users message will be recorded in the database though, so it will be processed when the current brain invocation is done.doOneBrainLoop(recentContext)
. doOneBrainLoop
returns { needsAnotherLoop: boolean, websocketMessage?: Message }
.needsAnotherLoop
is false and there is a websocketMessage
, brainGo
returns that message directly (this becomes the HTTP response when invoked via /api/voiceToBrain
).needsAnotherLoop
is true and there's a message, brainGo
sends the message over the websocket (sendWebsocketMessage
) and then continues into the next loop.needsAnotherLoop
is false and there's no message, brainGo
returns null
.needsAnotherLoop
is true and there's no message, brainGo
proceeds to the next loop.needsAnotherLoop
is primarily determined by whether the LLM calls the sayToUserAndContinue
tool or the sayToUserAndQuit
tool.brainGo
detects it will run out of execution time and must continue later, it persists state and triggers the reinvocation path (brainToBrain
) and then returns null
.doOneBrainLoop
accepts recentContext
as an optional argument so the first loop can include the most recent client-side message (which may not yet be persisted). It fetches persisted messages and tool results as needed and composes the full context for LLM/tool calls. It returns whether another loop is required and an optional websocketMessage
to deliver now.
Back-to-front messages have two delivery paths in the current architecture:
brainGo
is invoked directly by /api/voiceToBrain
and brainGo
determines it is finished and wants to send a final message, it returns that message via the HTTP response. This is the fastest path for immediate answers.brainToBrain
, messages are sent via the websocket. Breadcrumbs and backend event notifications always go down the websocket.The voice interface has handlers for both delivery methods: it can accept an immediate HTTP response from askBrain
or listen to the websocket for system messages and breadcrumbs. The UI tracks websocket connection status and presence so it can surface delays or reconnections.
Tools that send messages to the user include a doneAfterThis
flag. At low levels, tools indicate whether the message should be terminal (suitable for the HTTP path) or not (must be delivered via the websocket). This flag propagates up through doOneBrainLoop
to brainGo
so delivery decisions are deterministic.
/api/brainToBrain
is the reinvocation path: when the brain will exceed the current invocation it persists state and enqueues or reinvokes a Vercel function to continue processing (this is the reinvokeOnTimeout()
function). In this reinvocation mode, returning messages via HTTP is not allowed — all messages must use the websocket.
There is a websocket→brain HTTP route (kept for compatibility) which validates schema and auth but is effectively a no-op for client→brain messages in normal operation because front-to-back traffic flows over HTTP. This allows other messaging setups in the future, like sending user messages to the brain via the websocket.
If the product matures and the realtime load justifies it, I’d move to a globally distributed connection tier (containers + edge routing) and a small control-plane service that assigns brains to connections, persists state, and deals with reconnections. I’d keep the Vercel brain as a migration path and only move latency-critical parts out of it.
Or - hope that Vercel builds a websocket service and sprinkles their magic dust on it. I'd be user #1.
The serverless model is amazing for a huge class of problems, but it’s intentionally constrained. When your app needs durable connections and background-initiated messages, accept that you’ll need a small amount of stateful infrastructure.
You can buy performance at the cost of complexity by moving more functionality into the stateful compute.