Tech Stack

Python 3.11, FastAPI, Redis, Qdrant, TypeSense, BookNLP, multi-provider LLM and embedding, Jina reranker. The runtime and infrastructure choices that shape the backend.

The backend is a Python 3.11 FastAPI service. Every other choice (queue, vector store, search engine, LLM provider, reranker) is pluggable at the boundary, but the runtime itself is fixed.

Runtime

Language: Python 3.11
Web framework: FastAPI on Uvicorn
Container: python:3.11-slim, exposed on port 8001, joined to an external Traefik network
Streaming: Server-Sent Events via StreamingResponse

Storage and Search

Job queue: Redis FIFO list, driven by RPUSH (enqueue) and LPOP (dequeue). Also used for caches and per-session locks.
Vector search: Qdrant. One collection per book_id, async client. Cosine distance with a configurable EMBEDDING_DIMENSION.
Full-text search: TypeSense. One collection per book_id, BM25 scoring.
Relational store: Supabase via the REST API, accessed with the service-role key.
System prompt backup: local filesystem under SYSTEM_PROMPT_STORAGE_PATH/{book_id}/{safe_name}.md.

ML and LLM

Character extraction: BookNLP2. Produces entity, quote, coref, supersense, and event JSON. Synchronous, runs in a worker thread.
LLM providers: OpenAI, Anthropic, and Ollama (local or cloud). Selected at runtime by LLM_PROVIDER. A unified async dispatcher in utils/chat/chat_llm.py abstracts them.
Embedding providers: OpenAI, Cohere, and Jina. Selected at runtime by EMBEDDING_PROVIDER. Same async-dispatcher pattern.
Reranker: Jina Reranker v3 over HTTP. Falls back to score-sort if no API key is configured.
Token counting: tiktoken with o200k_base. Used for chunking.

Concurrency Primitives

Event loop: asyncio for the FastAPI request handlers and the job manager.
Threading: asyncio.to_thread for blocking BookNLP work; threading.Lock for the BookNLP singleton.
Process locks: asyncio.Lock inside the JobManager for capacity-gated queue popping.
Redis locks: SET NX with a hex token, released by a Lua CAS script, for per-session write locks.

HTTP and Utilities

HTTP client: httpx (sync and async). Used to download book files and call the Jina reranker.
Metrics: prometheus-client with a custom MetricsMiddleware that auto-instruments every request.

Why These Choices

FastAPI + asyncio. Streaming chat responses map naturally to SSE and StreamingResponse. The job manager and request handlers share the same event loop.
Redis as a queue and a cache. This avoids a second piece of infrastructure for queueing. Caches and locks live in the same place as the queue.
Qdrant and TypeSense as separate stores. They are not substitutes. Vector search is good for narrative, semantic queries; BM25 is good for exact keywords (names, places, distinctive phrases). Hybrid retrieval combines both.
Provider-agnostic LLM and embedding. The runtime picks a provider from env, so the same code runs against OpenAI in production and Ollama in development.