Why a Multi-Agent Pipeline
When we started building Studeia's AI tutor, the temptation was obvious: call the Claude API with a long system prompt and the student's messages. It works in demos. It fails in production.
Four structural problems:
-
No persistent student memory β each turn treats the student as a blank slate. The student just got concept X wrong 3 times in quizzes? The LLM doesn't know. The tutor repeats a basic explanation it already gave last week.
-
No grounding in course material β the institution has handouts, slides, and video-lecture transcripts. A plain LLM can't access them. It makes up facts. It cites wrong concepts.
-
No real moderation β a system prompt saying "be safe" is weak. A student in mental distress shows up. A student attempts a jailbreak. A student uses inappropriate language. The tutor needs to react appropriately without shutting down functionality for legitimate students.
-
No feedback loop β the system doesn't learn. The same misconceptions repeat. The instructor has no visibility into the class's collective weak spots.
Solution: separate responsibilities into specialized agents.
The Architecture
Student message
β
PRE-LLM (synchronous, zero LLM cost)
1. StudentModelService.getSnapshot()
2. RetrievalAgent.retrieve()
3. PedagogicalAgent.select()
4. buildEnrichedPrompt()
β
MAIN LLM (streaming, SSE to client)
5. router.stream() with automatic fallback
Claude Sonnet β GPT-4o β Grok-3 β Gemini Pro
β
POST-LLM (background via after(), fire-and-forget)
6. EvaluationAgent (Haiku)
7. ContentAgent (Haiku)
8. SupervisorAgent (Haiku)
Let's walk through each one.
1. StudentModelService β the "Cognitive Profile"
Before any LLM call, we load the student's snapshot:
const snapshot = await StudentModelService.getSnapshot({
userId,
courseId,
});
// Returns:
{
conceptMastery: Map<conceptId, { probability, confidenceInterval }>,
misconceptions: Misconception[], // active + resolving
episodicMemory: Episode[], // what worked before
quizContext: {
totalAttempts,
avgScore,
passRate,
weakAreas: string[] // concepts with mastery < 0.4
},
recentHistory: Message[] // sliding window of 10 msgs
}
ConceptMastery uses a Bayesian Beta distribution β each concept has alpha (successes) + beta (failures). Probability = alpha / (alpha + beta). Confidence interval via 5th and 95th percentiles.
EpisodicMemory records pedagogical insights: "the pizza analogy worked for explaining fractions," "the water-pipe metaphor failed for electricity." The system learns what works with each student.
Zero LLM cost. Everything is Prisma queries + deterministic computation.
2. RetrievalAgent β Tenant-Scoped RAG
Instead of having the LLM try to recall facts about math, biology, or history, we let it cite the institution's own material.
const chunks = await retrieve({
query: reformulatedQuery, // 1. reformulates query with context
filters: { tenantId, courseId }, // 2. absolute isolation
k: 10,
tenantOnlyMode: true, // 3. never cites another institution's content
boostByWeakAreas: snapshot.quizContext.weakAreas, // 4. prioritizes chunks from weak areas
});
Per-tenant RAG is critical. Test-prep school XYZ has its own material for college entrance exams. University ABC has its own Calculus material. The tutor cites the institution's CORRECT material, not a generic aggregate.
Each chunk has metadata: { source: "course_lesson", courseId, lessonId, lessonTitle, moduleTitle }. When the tutor responds, it cites: "As explained in the lesson 'Analytic Geometry' in module 3β¦"
Voyage AI generates embeddings (1024 dimensions, OpenAI fallback). pgvector stores them. tenantOnlyMode: true ensures WHERE tenantId = X is always in the query. Project critical rule: zero cross-tenant leakage.
3. PedagogicalAgent β Strategy Adaptation
Pure determinism. Evaluates the student's mastery in the specific domain and selects one of 5 strategies:
| Mastery | Strategy | Behavior |
|---|---|---|
| < 0.3 | direct_instruction | Clear explanation, concrete examples, step-by-step |
| 0.3β0.5 | scaffolding | Progressive hints, simple guided questions |
| 0.5β0.7 | socratic | Questions that lead to discovery |
| 0.7β0.9 | guided_practice | Exercises with feedback, practical application |
| > 0.9 | challenge | Complex problems, connections between concepts |
Additional adjustments for quiz vs. chat divergence:
- High chat mastery + low quiz score β "surface-level understanding" β nudge DOWN
- Low mastery + high quiz score β "quiet student" β nudge UP
- Quiz pass rate < 40% β cap at scaffolding (don't advance to socratic yet)
Also adjusts for age (User.isMinor), learning style, and domain (math vs. literature have different profiles).
Zero LLM cost. Output: selected strategy + specific instructions to add to the system prompt.
4. Orchestrator β buildEnrichedPrompt
Assembles the enriched system prompt:
You are an AI tutor for the course "Calculus I" at "XYZ Test Prep School."
STUDENT DOMAIN KNOWLEDGE:
- Limits: mastery 0.78 (high)
- Derivatives: mastery 0.42 (medium)
- Integrals: mastery 0.15 (low)
ACTIVE MISCONCEPTIONS:
- "Student confuses domain with range in functions" (3 occurrences, status: resolving)
- "Student applies the sum rule for derivatives to products" (5 occurrences, status: active)
QUIZ PERFORMANCE:
- 14 total attempts, avgScore 67%, passRate 71%
- Weak areas: integrals (avg 45%), chain rule (avg 52%)
PEDAGOGICAL STRATEGY: guided_practice
- Student has medium mastery of derivatives. Present graduated exercises.
- Reinforce the connection between limits and derivatives (they already master limits).
- Proactively address the misconception about the product rule for derivatives.
RAG CONTEXT (from course material):
[Lesson 3.2 "Product Rule"] (Module: Differential Calculus)
"The derivative of f(x)Β·g(x) is NOT f'(x)Β·g'(x). The correct rule is..."
[Lesson 3.5 "Solved Exercises"] (Module: Differential Calculus)
"Example: differentiate (x^2 + 1)Β·(x - 3) using the product rule..."
RECENT QUIZ (next conversation):
Student just answered an inline quiz with 2 questions, got 1 correct.
INSTRUCTIONS:
- Respond in English
- Cite course material when relevant (use [Lesson X.Y])
- Acknowledge what the student got right before pointing out errors
- For this age group (User.ageRange = "young_adult"): casual language without being too informal
5. Main LLM β Streaming with Fallback
const stream = await router.stream({
taskType: "chat_tutor",
messages: enrichedMessages,
options: { tenantId, userId, sessionId }
});
for await (const chunk of stream.textStream) {
yield chunk; // SSE to client
}
The LLM Router:
- Resolves provider via
TenantTaskModelConfig(admin chose Claude Sonnet, GPT-4o, etc.) - Resolves API key via cascade: TenantApiKey β global ProviderApiKey β process.env
- Circuit breaker check (Redis state). If provider is OPEN: skip directly to fallback
- Metering middleware: rate limit + credit check + cost calculator
- Streams via Vercel AI SDK (supports tools, multimodal, structured output)
- On error: automatic fallback to next provider in chain
Fallback chain by tier:
Sonnet tier (medium): Claude Sonnet β GPT-4o β Grok-3-fast β Gemini Pro
Haiku tier (fast): Claude Haiku β GPT-4o-mini β Grok-3-mini β Gemini Flash
Opus tier (complex): Claude Opus β GPT-4.5 β Grok-3 β Gemini 2.5 Pro
Tenants NEVER lose access to their tutor. If Anthropic goes down β OpenAI takes over. If OpenAI also goes down β xAI. Etc.
6. EvaluationAgent β Feedback Loop
Runs in the background after the tutor responds:
after(async () => {
const evaluation = await router.generateDirect({
taskType: "chat_evaluation",
messages: [
{ role: "user", content: "Student said: '...'. Tutor responded: '...'. Classify." }
]
});
// evaluation: {
// understanding: "partial",
// detectedMisconceptions: [{ description, concepts, severity }],
// suggestedNextStep: "..."
// }
// Updates ConceptMastery via Bayesian update
await conceptMasteryEngine.updateFromTurn({
userId, courseId, evaluation
});
// Persists/updates misconceptions
for (const misc of evaluation.detectedMisconceptions) {
await misconceptionResolutionService.upsert({
userId, source: "chat", ...misc
});
}
});
Cost: ~$0.001 per turn (Haiku). Does NOT block the student's request.
Misconceptions have a 3-state lifecycle: active β resolving β resolved. A state machine determines transitions based on evidence (mastery update, quiz pass, tutor explicitly addressed it).
7. ContentAgent β Proactive Pre-Generation
after(async () => {
// Student shows weak mastery of concept X
// Pre-generates a follow-up exercise while the student reads the current response
const exercise = await router.generateDirect({
taskType: "content_generation",
messages: [...]
});
// Redis cache for 30 min
await redis.set(`next-exercise:${userId}:${conceptId}`, exercise, 1800);
});
When the student finishes reading the response and says "give me an exercise," Studeia serves it INSTANTLY from cache. No perceptible latency.
Cost: ~$0.001 per turn (Haiku).
8. SupervisorAgent β Moderation
Runs in the background after each turn. Classifies into 5 severity levels Γ 8 categories.
Categories: inappropriate language, violence, illegal, sexual, off_topic, harassment, self_harm, jailbreak_attempt.
Severity: low β medium β high β critical β safety.
3 strikes (LOW/MEDIUM within 7 days) = 48-hour quarantine. CRITICAL = 7-day quarantine.
Self-harm (severity=safety) NEVER penalizes the student. Instead:
- Tutor is interrupted with a supportive message
- Crisis resources (US: 988 Suicide & Crisis Lifeline)
- 24-hour Redis cooldown (not quarantine)
- URGENT immediate email to the institutional admin
Philosophy: self-harm is a crisis, not a violation. Details in Safety Supervisor.
Cost: ~$0.001 per turn (Haiku).
Production Numbers
After 6 months in production:
- ~30 ms additional latency from pre-LLM agents
- ~$0.005β$0.05 average cost per turn
- 91% student retention rate after 7 days (vs. ~40% benchmark for AI tutors without state)
- 3.2Γ misconception detection rate vs. single-call baseline
- 0 serious safety incidents (high/critical categories)
Honest Trade-offs
Things that did NOT work:
-
We tried an LLM-driven "MasterAgent" coordinator to dynamically choose the next agent. Cost doubled, latency increased by 800 ms, quality did NOT improve. We reverted to determinism in the Orchestrator.
-
We tried fine-tuning Llama on course material. Expensive for each tenant. RAG works better for dynamic knowledge (institutions update material every week β a fine-tune would go stale).
-
We tried "consensus" across 3 LLMs (Claude + GPT + Gemini) and picking the majority response. 3Γ cost with no meaningful quality gain. Removed β the fallback chain is sufficient.
Open Source?
We're evaluating open-sourcing the deterministic components (StudentModelService, RetrievalAgent, PedagogicalAgent) as an npm package. The LLM-driven agents (Evaluation, Content, Supervisor) contain prompts that are Studeia IP and will remain closed.
If you're interested: open an issue at github.com/donattocosta-lang/studeia/issues.