The hot path is the visitor-message → first-token pipeline. It has a hard 1 second p95 contract — beyond that, the perceived "is this thing alive?" tension breaks. Everything that doesn't have to happen on the hot path is pushed off it.
Target breakdown for first-token at p95:
| Phase | p95 target |
|---|---|
| HTTP receive + auth | 30 ms |
| Curated short-circuit check | 5 ms |
| Embed query | 120 ms |
| Vector search (ANN) | 80 ms |
| Rerank | 120 ms |
| Prompt assembly | 10 ms |
| LLM time-to-first-token | 500 ms |
| Total | ~865 ms |
Headroom for the rest is < 150ms. Anything beyond first-token streams out incrementally — full-response time is bounded by token rate, not the budget.
These are enforced by code review and by tests:
conv:{id}:history), not Postgres.emit('done'). The human-takeover, human-pending, and human-intent short-circuits used to persist the visitor's Message + stamp conversation.attribution before the SSE close event reached the client. Audit 2026-05-30 moved these below emit('done') via persistShortCircuitVisitorTurn() so the stream closes immediately and the writes run after. The escalation_offered_at stamp emitted by the main LLM path is deferred to the same post-emit slot — next turn's tool loop re-reads the row from DB so the lag doesn't change tool gating semantics.Everything below is dispatched after the stream completes. None of it blocks the visitor:
SignedDispatcher when a visitor submits the lead form)./init time that queues a CrawlPageJob for the visited URL when auto-indexing is enabled. Not a hot-path job, but worth knowing where auto-index runs.Two caches keep the hot path tight:
rag:retrieve:{agentId}:{hash(query|currentPageUrl)}, 30-minute TTL. Same question on the same page hits cache. Invalidated when sources change.conv:{convId}:history, 2-hour TTL, capped at 12 messages (6 turns). Reads from this on every turn instead of Postgres.
SSE is dead simple — keep-alive HTTP, write
data: {...}\n\n per token, flush. The widget reads via
EventSource (or fetch + reader for older browsers without
EventSource on POST).
Critically, the SSE response is constructed before any RAG work runs. We start writing headers immediately on request receipt so any proxy in front of us (Cloudflare, load balancer) commits to streaming early. By the time tokens arrive, the connection is already open.
OpenTelemetry spans wrap each phase:
widget.message.receiverag.curated.matchrag.embedrag.vector.searchrag.rerankrag.prompt.assemblerag.llm.first_tokenrag.llm.streamrag.persist.asyncHoneycomb / Grafana shows the p95 of each. When the budget breaks, the span heatmap usually points right at the offender.
| Failure | Behavior |
|---|---|
| LLM provider 5xx mid-stream | Stream emits an error event. Widget auto-retries up to 3 times. |
| Vector store unreachable | Pipeline returns the question with no grounding. Confidence is 0.3 → low_confidence flag → "I don't know" answer. |
| Embed call times out | Same — proceed with no grounding, flag low_confidence. |
| Quota exceeded | Caught at /init, never reaches messages. 429 returned. |
The principle: the visitor always gets a response, even if it's "I'm not sure". The agent is allowed to be ignorant; it isn't allowed to silently break.