Concurrency Model

BookNLP singleton, JobManager capacity gate, per-session chat locks, Redis connection pool, and per-request async clients. How the backend keeps many concurrent requests from corrupting shared state.

The backend handles two kinds of concurrency: long-running heavy work (BookNLP, LLM calls) and many concurrent chat streams sharing Redis-backed state. Each kind has its own primitives.

BookNLP Singleton

BookNLP is a module-level singleton. The pipelines are loaded once during the FastAPI lifespan and reused across jobs. Two threading.Lock instances guard the singleton: _nlp_lock protects the model cache during initialization, and _process_lock serializes calls to nlp.process(). Together they ensure two worker threads cannot run BookNLP concurrently (its underlying libraries are not thread-safe).

Models are loaded eagerly in the lifespan handler. The first job does not pay the model-load cost.

JobManager Capacity Gate

The JobManager runs as a long-lived coroutine started in the FastAPI lifespan. It polls Redis with LPOP every 10 seconds. Before dispatching a job, it acquires an asyncio.Lock to atomically:

Check the active worker count.
If under MAX_PROCESSING_WORKERS, increment the count and reserve a slot.
Otherwise, leave the job in the queue and try again on the next poll.

The lock is released as soon as the worker is dispatched. The actual work runs in asyncio.to_thread (because BookNLP is synchronous), and a fresh asyncio event loop is spun up inside the thread for the index pipeline.

This gives at most MAX_PROCESSING_WORKERS concurrent BookNLP jobs, even if hundreds of books are queued. The queue is FIFO, so jobs are processed in submission order modulo the capacity gate.

Per-Session Chat Locks

A per-session write lock lives in Redis at chat:lock:{session_id}. The endpoint acquires it with SET {key} {hex_token} NX EX 30. The 30-second TTL bounds the lifetime of a stuck lock: if a worker crashes, the next request can take over after at most 30 seconds.

The lock is released by a Lua CAS script that compares the stored token to the value of the lock key. Only the worker that wrote the token can release the lock. This prevents a slow worker that has already given up from accidentally releasing a newer worker's lock.

Redis Connection Pool

Redis is used as a process-wide connection pool with a maximum of 20 connections. The pool is shared by the chat endpoint, the JobManager, and the various caches. If more than 20 coroutines try to use Redis at the same time, the 21st waits for a connection.

Per-Request Async Clients

Qdrant, TypeSense, and Supabase are not pooled at the process level. Clients are created per-call: QdrantWrapper and TypeSenseWrapper are instantiated inside each RAG retrieval or pipeline function and closed after use. Supabase reads and writes use a fresh httpx.AsyncClient per operation. This avoids stale-connection issues and keeps client lifecycles simple.

What This Means in Practice

Many chat streams can run at the same time. They are bounded only by the SSE fan-out, not by the backend's process state. Each session is serialized by its Redis lock, but different sessions are independent.
Ingestion is a background loop with a fixed concurrency cap. The frontend does not block on ingestion; it polls the books.processing_status field.
A single BookNLP run holds the BookNLP singleton lock. If a request comes in while BookNLP is busy, the next ingestion job will start only after the current one finishes its BookNLP phase (the index pipeline runs in parallel within the same job).
Per-chunk and per-character failures never block other chunks or characters in the same job.