Why per-tenant RAG instead of shared RAG?

Three non-negotiable reasons: (1) LGPD/GDPR compliance — material from one school CANNOT appear in responses for students of another school. (2) Pedagogical quality — institutions have their own material, their own approach, contextualized examples; mixing Stanford content with a local prep course pollutes the response. (3) Commercial confidentiality — premium prep course material is that course's IP; they don't want it exposed to competitors.

What is the cost of embeddings at scale?

Voyage AI charges $0.00005 per 1K tokens (Studeia's primary version). Average course: 30 lessons, ~50K words = ~70K tokens. Initial embedding: ~$0.0035 per course. Incremental re-ingestion: ~$0.0001 per edited lesson. For a tenant with 100 courses: ~$0.35 initial setup + ~$5–10/month in deltas. Negligible cost compared to the value delivered.

How many vectors can pgvector handle before performance degrades?

pgvector with IVFFlat handles millions of vectors with <100ms latency if the index is well-tuned (lists = sqrt(N), probes = 10–20). HNSW (Postgres 16+) is better for scale: 10M+ vectors with <50ms. Studeia tested with ~500K chunks in production, p95 retrieval latency = 47ms. Above 5M, consider partitioning by tenantId or pgvector-rs.

How do you update embeddings when a lesson changes?

Course.autoSyncRag=true enables automatic incremental re-ingestion via Next.js after(). Every edit via API triggers: delete old chunks for the lesson + chunk new content + embed + insert. No downtime, no full rebuild. For bulk edits: calling /api/institution/courses/[id]/rag-ingest with mode='full' rebuilds from scratch.

RAG per-tenant at scale: architecture for B2B LMS

Q: How do you update embeddings when a lesson changes?

Course.autoSyncRag=true enables automatic incremental re-ingestion via Next.js after(). Every edit via API triggers: delete old chunks for the lesson + chunk new content + embed + insert. No downtime, no full rebuild. For bulk edits: calling /api/institution/courses/[id]/rag-ingest with mode='full' rebuilds from scratch.

Why real multi-tenancy in RAG is hard

Most LMS platforms with an "AI tutor" rely on problematic approaches:

Shared global RAG — all tenants see the same knowledge base. Functional, but violates compliance and pedagogical quality.
"Per-tenant" via metadata filter without enforcement — chunks have a tenant_id field, but the filter is optional in the query. One bug in one endpoint = data leak.
Separate vector DB per tenant — brutal operational overhead. A thousand tenants = a thousand vector DBs.

Studeia solved this with 3 architectural invariants.

Invariant 1: mandatory tenantId+courseId filter

Every pgvector query in Studeia MUST go through packages/core/src/ai/rag.ts:

export async function retrieve(params: RetrieveParams) {
  if (!params.tenantId && !params.allowGlobal) {
    throw new Error('tenantId required unless allowGlobal=true');
  }

  const filter = params.tenantId
    ? Prisma.sql`WHERE ce.tenant_id = ${params.tenantId}${params.courseId ? Prisma.sql` AND ce.course_id = ${params.courseId}` : Prisma.empty}`
    : Prisma.empty;

  return prisma.$queryRaw`
    SELECT ce.*, 1 - (ce.embedding <=> ${vectorStr}::vector) as similarity
    FROM content_embeddings ce
    ${filter}
    AND 1 - (ce.embedding <=> ${vectorStr}::vector) > 0.5
    ORDER BY similarity DESC LIMIT 10
  `;
}

allowGlobal is only true on explicit administrative routes (global admin testing RAG coverage). In EVERYTHING else, it throws.

A critical project rule (rule 6 in CLAUDE.md): "Tenant isolation: all B2B queries filter by tenantId." Automated auditing via Vitest tests verifies that every call to retrieve() in application code passes tenantId.

Invariant 2: tenantOnlyMode in RetrievalAgent

Even with the correct filter in place, there are cases where a fallback is desired (e.g., B2C without a tenant). To guarantee B2B NEVER leaks:

const chunks = await retrieve({
  query,
  filters: { tenantId, courseId },
  tenantOnlyMode: true,  // <-- CRITICAL
});

tenantOnlyMode: true means: if no chunks exist for the tenant, return empty — do not search the global index. The tutor responds "I don't have material on that in your course" instead of hallucinating.

Invariant 3: PostgreSQL RLS as a safety net

Supabase RLS policies add a defense-in-depth layer:

CREATE POLICY tenant_isolation_content_embeddings
ON content_embeddings
FOR SELECT
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

If a bug in application code forgets the filter, RLS blocks the query. Defense in depth.

There is a production cost: every Postgres query evaluates the policy. But the added latency is ~2–5ms, which is acceptable.

Ingestion pipeline

POST /api/institution/courses/[id]/rag-ingest { mode: "full" | "incremental" }
  ↓
1. List published lessons for the course
2. For each lesson, extract text by type:
   - rich_text → strip HTML via DOMPurify
   - slides → join text elements + speaker notes
   - quiz → join question + explanation per item
   - pdf → document-extractor (PyPDF + Adobe extract fallback if native extraction fails)
   - video → LiveClassTranscription.transcriptionText (Whisper → Google STT fallback)
   - assignment → instructions
3. Chunking: 800 tokens, 200 overlap, preserving semantic structure
   (does not break paragraphs mid-sentence, does not break code in the middle of a function)
4. Embeddings via Voyage AI (1024 dims, fallback to OpenAI text-embedding-3-large)
5. Creates ContentBlock + ContentEmbedding with metadata:
   { source: "course_lesson", courseId, lessonId, lessonTitle, moduleTitle, ingestionId }
6. Final status in CourseRagIngestion (pending → running → completed | failed)

Semantic chunking — why it matters

Naive chunking (every N chars) breaks context. For example: a lesson contains a Python code snippet that gets split across different chunks — the individual embedding of each half fails to capture the meaning.

Studeia uses a recursive splitter with a hierarchy of separators:

Tries to split at paragraph (\n\n)
If not, splits at sentence (. )
If not, splits at word
If not (rare), truncates

It also preserves code blocks INTACT (between triple backticks):

function recursiveChunk(text, maxTokens = 800, overlap = 200) {
  // Identify protected ranges (code blocks, markdown tables)
  const protectedRanges = findProtectedRanges(text);

  // Split respecting hierarchy + protection
  return splitWithHierarchy(text, {
    separators: ['\n\n', '. ', ' ', ''],
    maxTokens,
    overlap,
    protectedRanges,
  });
}

Result: chunks of ~600–800 tokens with 200 overlap, semantically coherent.

Voyage AI vs OpenAI — why a different primary

We started with OpenAI text-embedding-3-large. We migrated to Voyage AI as primary in 2026 H1. Reasons:

Aspect	OpenAI text-emb-3-large	Voyage AI voyage-3
Cost / 1K tokens	$0.00013	$0.00005
Native dimensions	3072 (reducible via dimensions param)	1024 native
MTEB benchmark (English)	64.6	67.2
MIRACL benchmark (multilingual)	average	better
Free-tier rate limits	3K RPM	3M tokens/min

Voyage is ~2.6x cheaper + better benchmark on educational retrieval + robust multilingual support (important for Studeia's es-ES + fr-FR locales).

Automatic fallback to OpenAI when Voyage has an outage:

async function embedText(texts: string[]) {
  try {
    return await voyageEmbed(texts);
  } catch (err) {
    console.warn('[embed] Voyage failed, falling back to OpenAI', err);
    return await openaiEmbed(texts, { dimensions: 1024 });  // reduced to 1024 for compatibility
  }
}

Important: both produce 1024-dim vectors, so pgvector accepts them without a schema change.

pgvector tuning in production

Default pgvector is great for <100K vectors. Above that, without tuning, latency degrades.

Studeia config (tested with 500K+ chunks):

-- IVFFlat index
CREATE INDEX content_embeddings_embedding_idx
ON content_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 700);  -- sqrt(500000) ≈ 700

-- Query uses probes
SET ivfflat.probes = 15;  -- more probes = better recall, more latency

Trade-offs:

lists too low: slow queries (full scan)
lists too high: large index, slow inserts
probes too low: OK latency, poor recall (relevant chunks missed)
probes too high: high recall, latency degrades

For Studeia in production: lists=700, probes=15. p95 latency = 47ms for top-10 retrieval across 500K chunks.

For 5M+ scale: evaluate HNSW (Postgres 16+) or partitioning by tenantId.

autoSyncRag — incremental rebuild

A course is a living organism. A teacher edits lesson 17. Adds a video. Updates a quiz. The system needs to re-embed only the delta, not the entire course.

Course.autoSyncRag: Boolean @default(false)

When true, every lesson edit via API:

// PATCH /api/institution/courses/[id]/modules/[mid]/lessons/[lid]
await prisma.courseLesson.update({ data: ... });

// Background — does not block the request
after(async () => {
  if (course.autoSyncRag) {
    await courseRagIngestionService.reingest({
      courseId,
      mode: "incremental",
      onlyLessonId: lessonId,
    });
  }
});

Incremental re-ingestion:

Delete old chunks for the lesson (WHERE lesson_id = X)
Re-extract text from the updated lesson
Re-chunk
Re-embed
Insert new chunks

Time: ~3–8s per average lesson (depends on size). Students NEVER experience stale RAG.

Production numbers

Metric	Value
Total chunks in production	~500K
Active tenants	50+
Courses with RAG ingested	280+
Largest tenant (chunks)	47K
p50 retrieval latency	28ms
p95 retrieval latency	47ms
p99 retrieval latency	124ms
Embedding cost previous month	$34 (proportional to edit volume)
Cross-tenant leakage incidents	0 (6 months)

Honest trade-offs

What did NOT work:

We tried hierarchical retrieval (search the summary first, then full chunks). Complex implementation, marginal quality gain on simple queries. We removed it.
We tried query reformulation via LLM (passing the student's query through an LLM before embedding to normalize it). Cost doubled (one more LLM call), latency +400ms, quality only marginally better for very vague queries. We only do reformulation in RetrievalAgent when the query is ambiguous (simple heuristic).
We tried re-ranking via Cohere rerank-3. Expensive ($0.001 per re-rank), latency +200ms. For 90% of queries, pgvector cosine + boost for weak areas is sufficient. We keep re-ranking available but off by default.

What wasn't possible 2 years ago

pgvector reached production readiness in 2022. Voyage AI launched voyage-3 in 2024 H2. Before that, alternatives (Pinecone, Weaviate, Qdrant) were paid + operationally complex for multi-tenant setups.

Today, with mature pgvector + cheap embeddings + Supabase RLS, production-grade per-tenant RAG is accessible. We recommend it for any serious B2B LMS.