Why real multi-tenancy in RAG is hard
Most LMS platforms with an "AI tutor" rely on problematic approaches:
-
Shared global RAG — all tenants see the same knowledge base. Functional, but violates compliance and pedagogical quality.
-
"Per-tenant" via metadata filter without enforcement — chunks have a
tenant_idfield, but the filter is optional in the query. One bug in one endpoint = data leak. -
Separate vector DB per tenant — brutal operational overhead. A thousand tenants = a thousand vector DBs.
Studeia solved this with 3 architectural invariants.
Invariant 1: mandatory tenantId+courseId filter
Every pgvector query in Studeia MUST go through packages/core/src/ai/rag.ts:
export async function retrieve(params: RetrieveParams) {
if (!params.tenantId && !params.allowGlobal) {
throw new Error('tenantId required unless allowGlobal=true');
}
const filter = params.tenantId
? Prisma.sql`WHERE ce.tenant_id = ${params.tenantId}${params.courseId ? Prisma.sql` AND ce.course_id = ${params.courseId}` : Prisma.empty}`
: Prisma.empty;
return prisma.$queryRaw`
SELECT ce.*, 1 - (ce.embedding <=> ${vectorStr}::vector) as similarity
FROM content_embeddings ce
${filter}
AND 1 - (ce.embedding <=> ${vectorStr}::vector) > 0.5
ORDER BY similarity DESC LIMIT 10
`;
}
allowGlobal is only true on explicit administrative routes (global admin testing RAG coverage). In EVERYTHING else, it throws.
A critical project rule (rule 6 in CLAUDE.md): "Tenant isolation: all B2B queries filter by tenantId." Automated auditing via Vitest tests verifies that every call to retrieve() in application code passes tenantId.
Invariant 2: tenantOnlyMode in RetrievalAgent
Even with the correct filter in place, there are cases where a fallback is desired (e.g., B2C without a tenant). To guarantee B2B NEVER leaks:
const chunks = await retrieve({
query,
filters: { tenantId, courseId },
tenantOnlyMode: true, // <-- CRITICAL
});
tenantOnlyMode: true means: if no chunks exist for the tenant, return empty — do not search the global index. The tutor responds "I don't have material on that in your course" instead of hallucinating.
Invariant 3: PostgreSQL RLS as a safety net
Supabase RLS policies add a defense-in-depth layer:
CREATE POLICY tenant_isolation_content_embeddings
ON content_embeddings
FOR SELECT
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
If a bug in application code forgets the filter, RLS blocks the query. Defense in depth.
There is a production cost: every Postgres query evaluates the policy. But the added latency is ~2–5ms, which is acceptable.
Ingestion pipeline
POST /api/institution/courses/[id]/rag-ingest { mode: "full" | "incremental" }
↓
1. List published lessons for the course
2. For each lesson, extract text by type:
- rich_text → strip HTML via DOMPurify
- slides → join text elements + speaker notes
- quiz → join question + explanation per item
- pdf → document-extractor (PyPDF + Adobe extract fallback if native extraction fails)
- video → LiveClassTranscription.transcriptionText (Whisper → Google STT fallback)
- assignment → instructions
3. Chunking: 800 tokens, 200 overlap, preserving semantic structure
(does not break paragraphs mid-sentence, does not break code in the middle of a function)
4. Embeddings via Voyage AI (1024 dims, fallback to OpenAI text-embedding-3-large)
5. Creates ContentBlock + ContentEmbedding with metadata:
{ source: "course_lesson", courseId, lessonId, lessonTitle, moduleTitle, ingestionId }
6. Final status in CourseRagIngestion (pending → running → completed | failed)
Semantic chunking — why it matters
Naive chunking (every N chars) breaks context. For example: a lesson contains a Python code snippet that gets split across different chunks — the individual embedding of each half fails to capture the meaning.
Studeia uses a recursive splitter with a hierarchy of separators:
- Tries to split at paragraph (\n\n)
- If not, splits at sentence (. )
- If not, splits at word
- If not (rare), truncates
It also preserves code blocks INTACT (between triple backticks):
function recursiveChunk(text, maxTokens = 800, overlap = 200) {
// Identify protected ranges (code blocks, markdown tables)
const protectedRanges = findProtectedRanges(text);
// Split respecting hierarchy + protection
return splitWithHierarchy(text, {
separators: ['\n\n', '. ', ' ', ''],
maxTokens,
overlap,
protectedRanges,
});
}
Result: chunks of ~600–800 tokens with 200 overlap, semantically coherent.
Voyage AI vs OpenAI — why a different primary
We started with OpenAI text-embedding-3-large. We migrated to Voyage AI as primary in 2026 H1. Reasons:
| Aspect | OpenAI text-emb-3-large | Voyage AI voyage-3 |
|---|---|---|
| Cost / 1K tokens | $0.00013 | $0.00005 |
| Native dimensions | 3072 (reducible via dimensions param) | 1024 native |
| MTEB benchmark (English) | 64.6 | 67.2 |
| MIRACL benchmark (multilingual) | average | better |
| Free-tier rate limits | 3K RPM | 3M tokens/min |
Voyage is ~2.6x cheaper + better benchmark on educational retrieval + robust multilingual support (important for Studeia's es-ES + fr-FR locales).
Automatic fallback to OpenAI when Voyage has an outage:
async function embedText(texts: string[]) {
try {
return await voyageEmbed(texts);
} catch (err) {
console.warn('[embed] Voyage failed, falling back to OpenAI', err);
return await openaiEmbed(texts, { dimensions: 1024 }); // reduced to 1024 for compatibility
}
}
Important: both produce 1024-dim vectors, so pgvector accepts them without a schema change.
pgvector tuning in production
Default pgvector is great for <100K vectors. Above that, without tuning, latency degrades.
Studeia config (tested with 500K+ chunks):
-- IVFFlat index
CREATE INDEX content_embeddings_embedding_idx
ON content_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 700); -- sqrt(500000) ≈ 700
-- Query uses probes
SET ivfflat.probes = 15; -- more probes = better recall, more latency
Trade-offs:
liststoo low: slow queries (full scan)liststoo high: large index, slow insertsprobestoo low: OK latency, poor recall (relevant chunks missed)probestoo high: high recall, latency degrades
For Studeia in production: lists=700, probes=15. p95 latency = 47ms for top-10 retrieval across 500K chunks.
For 5M+ scale: evaluate HNSW (Postgres 16+) or partitioning by tenantId.
autoSyncRag — incremental rebuild
A course is a living organism. A teacher edits lesson 17. Adds a video. Updates a quiz. The system needs to re-embed only the delta, not the entire course.
Course.autoSyncRag: Boolean @default(false)
When true, every lesson edit via API:
// PATCH /api/institution/courses/[id]/modules/[mid]/lessons/[lid]
await prisma.courseLesson.update({ data: ... });
// Background — does not block the request
after(async () => {
if (course.autoSyncRag) {
await courseRagIngestionService.reingest({
courseId,
mode: "incremental",
onlyLessonId: lessonId,
});
}
});
Incremental re-ingestion:
- Delete old chunks for the lesson (
WHERE lesson_id = X) - Re-extract text from the updated lesson
- Re-chunk
- Re-embed
- Insert new chunks
Time: ~3–8s per average lesson (depends on size). Students NEVER experience stale RAG.
Production numbers
| Metric | Value |
|---|---|
| Total chunks in production | ~500K |
| Active tenants | 50+ |
| Courses with RAG ingested | 280+ |
| Largest tenant (chunks) | 47K |
| p50 retrieval latency | 28ms |
| p95 retrieval latency | 47ms |
| p99 retrieval latency | 124ms |
| Embedding cost previous month | $34 (proportional to edit volume) |
| Cross-tenant leakage incidents | 0 (6 months) |
Honest trade-offs
What did NOT work:
-
We tried hierarchical retrieval (search the summary first, then full chunks). Complex implementation, marginal quality gain on simple queries. We removed it.
-
We tried query reformulation via LLM (passing the student's query through an LLM before embedding to normalize it). Cost doubled (one more LLM call), latency +400ms, quality only marginally better for very vague queries. We only do reformulation in RetrievalAgent when the query is ambiguous (simple heuristic).
-
We tried re-ranking via Cohere rerank-3. Expensive ($0.001 per re-rank), latency +200ms. For 90% of queries, pgvector cosine + boost for weak areas is sufficient. We keep re-ranking available but off by default.
What wasn't possible 2 years ago
pgvector reached production readiness in 2022. Voyage AI launched voyage-3 in 2024 H2. Before that, alternatives (Pinecone, Weaviate, Qdrant) were paid + operationally complex for multi-tenant setups.
Today, with mature pgvector + cheap embeddings + Supabase RLS, production-grade per-tenant RAG is accessible. We recommend it for any serious B2B LMS.