Chat Flow

End-to-end walkthrough of POST /character/chat. Per-session lock, query rewrite, hybrid retrieval across Qdrant and TypeSense, rerank, stream, persist.

The chat endpoint uses SSE for streaming but validates the session before the stream starts. Session-not-found and user-mismatch return HTTP 404 (not SSE). Once the SSE stream begins, errors are emitted as events - 429 for lock contention, 502 for LLM failures.

Sequence

Step 1: Session Validation

prepare_chat loads three contexts in sequence, each cached in Redis with a 3600s TTL:

Character context from chat:char:{character_id} - the full character row plus system prompt.
Session context from chat:sess:{session_id} - the session row plus last 10 messages.
Book context from chat:book:{book_id} - title, author, description, category.

On any cache miss the data is fetched from Supabase and written back to the cache. If the session does not exist or its user_id does not match the request body, the endpoint returns HTTP 404 (not an SSE event). The backend trusts the frontend for session lifecycle: it does not create, list, or delete chat sessions.

Step 2: Per-Session Lock

The endpoint tries to acquire a per-session write lock: SET chat:lock:{session_id} {hex_token} NX EX 30. If the lock is already held, the backend emits a 429 SSE event followed by a done event, and closes the stream. The hex token is a random value that the backend stores locally so it can release only its own lock.

The lock prevents two concurrent chat requests for the same session from interleaving their writes to the messages table. A 30-second TTL guarantees that a crashed worker cannot deadlock a session forever; the lock will expire on its own.

Step 3: Query Rewrite

A separate LLM call (temperature 0, max 200 tokens) takes the user's message plus the character's psychology and produces two outputs:

a narrative query (resolved for pronouns, expanded to a full descriptive sentence) used for vector search
a keyword query (key nouns, names, and distinctive phrases) used for BM25

The result is cached in Redis at chat:qr:{character_id}:{hash} where the hash is the SHA-256 of the user message concatenated with the last four conversation messages. The cache makes the rewriter a no-op for repeated phrasings.

If the rewriter fails, the backend degrades gracefully: the raw user message is used as both the narrative and the keyword query.

Step 4: Hybrid Retrieval

Two searches run in parallel:

Qdrant receives the narrative query, embeds it, and returns the top 10 chunks by cosine similarity.
TypeSense receives the keyword query and returns the top 10 chunks by BM25 score.

The two result sets are deduplicated by chunk_id with first-occurrence-wins, so a chunk that ranks highly in both stores is kept once and its best position determines its slot.

Step 5: Rerank

The deduplicated chunks are sent to the Jina Reranker v3 over HTTP. If no Jina API key is configured, the backend falls back to sorting by the chunks' raw scores from Qdrant and TypeSense.

The reranker returns a final ordering. The top chunks are formatted as numbered passages ([1] chunk text, [2] chunk text, ...) and become the "Relevant Passages" section of the system prompt.

If retrieval or reranking throws, the catch is silent: the backend substitutes an empty context string and continues the chat without grounding. The user sees a token stream either way.

Step 6: System Prompt Assembly

The full system prompt has three parts, in this order:

The character's persona prompt (written during ingestion).
The relevant passages section (empty if retrieval failed).
The session summary (regenerated every 10 messages; from cache otherwise).

Step 7: Streaming

The LLM is called with streaming enabled. Each token emitted by the model is forwarded to the client as data: {"token":"..."}. The frontend appends each token to the in-progress assistant message.

Step 8: Persistence and Lock Release

When the LLM stream completes:

The full user and assistant messages are inserted into the messages table.
The session's message_count and preview fields are updated in the chat_sessions table.
The Redis session cache is refreshed with the new last-10-messages and the updated summary.
If the message count crossed a multiple of 10, a background task regenerates the session summary.
The final SSE event data: {"done":true,"full_response":"..."} is emitted.
The per-session lock is released via a Lua CAS script that compares the stored token to the current value of the lock key, so a stale or wrong worker cannot release someone else's lock.

If the LLM call fails mid-stream, a data: {"error":"...","code":502} event is emitted followed by done. Messages are not persisted in that case.

Chat Flow

On this page