BackendHigh-Level Design
Tech Stack
Python 3.11, FastAPI, Redis, Qdrant, TypeSense, BookNLP, multi-provider LLM and embedding, Jina reranker. The runtime and infrastructure choices that shape the backend.
The backend is a Python 3.11 FastAPI service. Every other choice (queue, vector store, search engine, LLM provider, reranker) is pluggable at the boundary, but the runtime itself is fixed.
Runtime
- Language: Python 3.11
- Web framework: FastAPI on Uvicorn
- Container:
python:3.11-slim, exposed on port 8001, joined to an external Traefik network - Streaming: Server-Sent Events via
StreamingResponse
Storage and Search
- Job queue: Redis FIFO list, driven by
RPUSH(enqueue) andLPOP(dequeue). Also used for caches and per-session locks. - Vector search: Qdrant. One collection per
book_id, async client. Cosine distance with a configurableEMBEDDING_DIMENSION. - Full-text search: TypeSense. One collection per
book_id, BM25 scoring. - Relational store: Supabase via the REST API, accessed with the service-role key.
- System prompt backup: local filesystem under
SYSTEM_PROMPT_STORAGE_PATH/{book_id}/{safe_name}.md.
ML and LLM
- Character extraction: BookNLP2. Produces entity, quote, coref, supersense, and event JSON. Synchronous, runs in a worker thread.
- LLM providers: OpenAI, Anthropic, and Ollama (local or cloud). Selected at runtime by
LLM_PROVIDER. A unified async dispatcher inutils/chat/chat_llm.pyabstracts them. - Embedding providers: OpenAI, Cohere, and Jina. Selected at runtime by
EMBEDDING_PROVIDER. Same async-dispatcher pattern. - Reranker: Jina Reranker v3 over HTTP. Falls back to score-sort if no API key is configured.
- Token counting:
tiktokenwitho200k_base. Used for chunking.
Concurrency Primitives
- Event loop:
asynciofor the FastAPI request handlers and the job manager. - Threading:
asyncio.to_threadfor blocking BookNLP work;threading.Lockfor the BookNLP singleton. - Process locks:
asyncio.Lockinside the JobManager for capacity-gated queue popping. - Redis locks:
SET NXwith a hex token, released by a Lua CAS script, for per-session write locks.
HTTP and Utilities
- HTTP client:
httpx(sync and async). Used to download book files and call the Jina reranker. - Metrics:
prometheus-clientwith a customMetricsMiddlewarethat auto-instruments every request.
Why These Choices
- FastAPI + asyncio. Streaming chat responses map naturally to SSE and
StreamingResponse. The job manager and request handlers share the same event loop. - Redis as a queue and a cache. This avoids a second piece of infrastructure for queueing. Caches and locks live in the same place as the queue.
- Qdrant and TypeSense as separate stores. They are not substitutes. Vector search is good for narrative, semantic queries; BM25 is good for exact keywords (names, places, distinctive phrases). Hybrid retrieval combines both.
- Provider-agnostic LLM and embedding. The runtime picks a provider from env, so the same code runs against OpenAI in production and Ollama in development.
High-Level Design
Detailed design of the Falsafa backend: tech stack, module layout, endpoints, ingestion and chat flows, data ownership, concurrency, failure isolation, and observability.
Module Architecture
The app.py → services/ → utils/ layering. Services orchestrate business logic; utilities are independent, single-purpose primitives.