Failure Isolation
What happens when a character, a chunk, an embedding, retrieval, query rewrite, or a lock acquisition fails. The backend is designed so one bad component never aborts an entire book or chat.
The backend is built to keep working when individual pieces fail. The principle is: catch the failure, log it, degrade, and continue. A bad character should not abort the rest of the book. A bad chunk should not block the rest of the chat session.
Ingestion
- Per-character failures. A failed system-prompt generation or Supabase insert for one character is caught, logged, and skipped. The other characters in the book continue. The book's
processing_statusiscompletedif at least the indexing step finished for the rest of the book. - Per-chunk failures. A failed embed, Qdrant upsert, or TypeSense index for one chunk is counted. The chunk is logged and skipped. The book's final
processing_statusisfailedif any chunk failed, but the chunks that did succeed remain in the stores. - Embedding retries. Embedding calls retry up to three times on
BadRequestErrorwith delays of 30, 120, and 200 seconds. Other errors (connection failures, timeouts) skip retries - the chunk is logged and skipped, the pipeline continues. - Temp files. On success, the temp directory
/tmp/{book_id}/is cleaned up. On failure it is preserved for debugging. - Job manager crash. A crashed
JobManageris restarted by the FastAPI lifespan. The job it was processing may be partially done: characters and chunks that were already persisted remain; the next run of the same book will recreate collections and re-upsert from the start.
Chat
- RAG retrieval failure. If Qdrant, TypeSense, or the reranker throws, the catch is silent. An empty context string is substituted and the chat continues. The user sees a token stream either way; the response may be less grounded.
- Query rewrite failure. If the rewriter LLM call fails, the raw user message is used as both the narrative and keyword queries. Retrieval still runs, just with a less specific query.
- Per-session lock held. If the lock is held by another worker, the backend emits a
429SSE event followed bydone, then closes. The frontend should retry the message. - LLM call failure mid-stream. A
data: {"error":"...","code":502}event is emitted followed bydone. Messages are not persisted in that case. The frontend should surface the error to the user. - Supabase write failure after a successful stream. The user has already seen the full response. The backend logs the error. The frontend will see the messages missing on reload; it should re-fetch and may need to retry.
- Session not found or user mismatch. HTTP 404 is returned before the SSE stream starts. The frontend should not retry.
Lock Edge Cases
- Stale lock. The 30-second TTL bounds the lifetime of a stuck lock. If a worker crashes, the next request takes over after at most 30 seconds.
- Wrong-token release. The release path is a Lua CAS script. A worker can only release a lock that still holds its own hex token. A slow worker that has already given up cannot release a newer worker's lock.
- Redis pool exhaustion. If the 20-connection pool is exhausted, the 21st coroutine waits. There is no hard failure path; long waits just delay the response.
Why This Matters
The failure isolation choices are tuned for the property that matters most: a book with a few bad characters or a chunk that the embedder rejects should not require a full re-ingestion. A user typing into a chat box should always get either a stream or a clean error event, never a hung connection.
Concurrency Model
BookNLP singleton, JobManager capacity gate, per-session chat locks, Redis connection pool, and per-request async clients. How the backend keeps many concurrent requests from corrupting shared state.
Observability
Prometheus metrics, the MetricsMiddleware, and the /health and /metrics endpoints. What is instrumented and where to scrape.