The hot path is the visitor-message → first-token pipeline. It has a hard 1 second p95 contract — beyond that, the perceived "is this thing alive?" tension breaks. Everything that doesn't have to happen on the hot path is pushed off it.

Latency budget

Target breakdown for first-token at p95:

Phasep95 target
HTTP receive + auth30 ms
Curated short-circuit check5 ms
Embed query120 ms
Vector search (ANN)80 ms
Rerank120 ms
Prompt assembly10 ms
LLM time-to-first-token500 ms
Total~865 ms

Headroom for the rest is < 150ms. Anything beyond first-token streams out incrementally — full-response time is bounded by token rate, not the budget.

Hard rules

These are enforced by code review and by tests:

  1. No DB writes on the hot path. Persistence is async after the stream ends.
  2. No synchronous webhooks. Outgoing webhooks are dispatched as queue jobs.
  3. No retries. If a provider fails mid-stream, the user sees a graceful error and the widget auto-retries client-side. Server doesn't loop.
  4. No N+1 queries. All reads are batched. Recent history comes from Redis (conv:{id}:history), not Postgres.
  5. One LLM call per turn. No multi-step agent reasoning that fans out into multiple model calls.
  6. Short-circuit DB writes ride below emit('done'). The human-takeover, human-pending, and human-intent short-circuits used to persist the visitor's Message + stamp conversation.attribution before the SSE close event reached the client. Audit 2026-05-30 moved these below emit('done') via persistShortCircuitVisitorTurn() so the stream closes immediately and the writes run after. The escalation_offered_at stamp emitted by the main LLM path is deferred to the same post-emit slot — next turn's tool loop re-reads the row from DB so the lag doesn't change tool gating semantics.

What's off the hot path

Everything below is dispatched after the stream completes. None of it blocks the visitor:

Caching

Two caches keep the hot path tight:

Streaming mechanics

SSE is dead simple — keep-alive HTTP, write data: {...}\n\n per token, flush. The widget reads via EventSource (or fetch + reader for older browsers without EventSource on POST).

Critically, the SSE response is constructed before any RAG work runs. We start writing headers immediately on request receipt so any proxy in front of us (Cloudflare, load balancer) commits to streaming early. By the time tokens arrive, the connection is already open.

Where the spans live

OpenTelemetry spans wrap each phase:

Honeycomb / Grafana shows the p95 of each. When the budget breaks, the span heatmap usually points right at the offender.

Failure modes

FailureBehavior
LLM provider 5xx mid-streamStream emits an error event. Widget auto-retries up to 3 times.
Vector store unreachablePipeline returns the question with no grounding. Confidence is 0.3 → low_confidence flag → "I don't know" answer.
Embed call times outSame — proceed with no grounding, flag low_confidence.
Quota exceededCaught at /init, never reaches messages. 429 returned.

The principle: the visitor always gets a response, even if it's "I'm not sure". The agent is allowed to be ignorant; it isn't allowed to silently break.