Falsafa
BackendHigh-Level Design

Tech Stack

Python 3.11, FastAPI, Redis, Qdrant, TypeSense, BookNLP, multi-provider LLM and embedding, Jina reranker. The runtime and infrastructure choices that shape the backend.

The backend is a Python 3.11 FastAPI service. Every other choice (queue, vector store, search engine, LLM provider, reranker) is pluggable at the boundary, but the runtime itself is fixed.

Runtime

  • Language: Python 3.11
  • Web framework: FastAPI on Uvicorn
  • Container: python:3.11-slim, exposed on port 8001, joined to an external Traefik network
  • Streaming: Server-Sent Events via StreamingResponse
  • Job queue: Redis FIFO list, driven by RPUSH (enqueue) and LPOP (dequeue). Also used for caches and per-session locks.
  • Vector search: Qdrant. One collection per book_id, async client. Cosine distance with a configurable EMBEDDING_DIMENSION.
  • Full-text search: TypeSense. One collection per book_id, BM25 scoring.
  • Relational store: Supabase via the REST API, accessed with the service-role key.
  • System prompt backup: local filesystem under SYSTEM_PROMPT_STORAGE_PATH/{book_id}/{safe_name}.md.

ML and LLM

  • Character extraction: BookNLP2. Produces entity, quote, coref, supersense, and event JSON. Synchronous, runs in a worker thread.
  • LLM providers: OpenAI, Anthropic, and Ollama (local or cloud). Selected at runtime by LLM_PROVIDER. A unified async dispatcher in utils/chat/chat_llm.py abstracts them.
  • Embedding providers: OpenAI, Cohere, and Jina. Selected at runtime by EMBEDDING_PROVIDER. Same async-dispatcher pattern.
  • Reranker: Jina Reranker v3 over HTTP. Falls back to score-sort if no API key is configured.
  • Token counting: tiktoken with o200k_base. Used for chunking.

Concurrency Primitives

  • Event loop: asyncio for the FastAPI request handlers and the job manager.
  • Threading: asyncio.to_thread for blocking BookNLP work; threading.Lock for the BookNLP singleton.
  • Process locks: asyncio.Lock inside the JobManager for capacity-gated queue popping.
  • Redis locks: SET NX with a hex token, released by a Lua CAS script, for per-session write locks.

HTTP and Utilities

  • HTTP client: httpx (sync and async). Used to download book files and call the Jina reranker.
  • Metrics: prometheus-client with a custom MetricsMiddleware that auto-instruments every request.

Why These Choices

  • FastAPI + asyncio. Streaming chat responses map naturally to SSE and StreamingResponse. The job manager and request handlers share the same event loop.
  • Redis as a queue and a cache. This avoids a second piece of infrastructure for queueing. Caches and locks live in the same place as the queue.
  • Qdrant and TypeSense as separate stores. They are not substitutes. Vector search is good for narrative, semantic queries; BM25 is good for exact keywords (names, places, distinctive phrases). Hybrid retrieval combines both.
  • Provider-agnostic LLM and embedding. The runtime picks a provider from env, so the same code runs against OpenAI in production and Ollama in development.

On this page