Why multi-agent instead of a single LLM call?

A single LLM call has NO persistent student memory, does NOT know which course materials to cite (requires RAG), does NOT moderate inappropriate outputs, and does NOT update the Bayesian model of the student's domain knowledge. Multi-agent solves each problem with a specialized agent: StudentModel maintains state, RetrievalAgent performs tenant-scoped RAG lookup, PedagogicalAgent selects strategy, EvaluationAgent classifies misconceptions, ContentAgent pre-generates follow-ups, SupervisorAgent moderates. Each can be optimized independently.

How much does a full pipeline turn cost?

Typically $0.005–$0.05 per turn (depending on response size). Breakdown: main LLM Sonnet $0.005–$0.04 + EvaluationAgent Haiku $0.001 + ContentAgent Haiku $0.001 + SupervisorAgent Haiku $0.001. Deterministic TypeScript components such as StudentModel, RetrievalAgent and PedagogicalAgent have zero LLM cost.

How do you prevent costs from exploding with 1,000+ students?

Four mechanisms: (1) Background agents via Next.js `after()` — without blocking the request. (2) Haiku for background tasks (~30x cheaper than Sonnet). (3) Metering middleware with per-user rate limiting. (4) Tenants can bring their own API key (TenantApiKey) — Studeia does not charge an AI margin; costs go directly to the tenant's account at Anthropic/OpenAI.

How We Built a Multi-Agent AI Tutor Pipeline for Online Learning

Why a Multi-Agent Pipeline

When we started building Studeia's AI tutor, the temptation was obvious: call the Claude API with a long system prompt and the student's messages. It works in demos. It fails in production.

Four structural problems:

No persistent student memory — each turn treats the student as a blank slate. The student just got concept X wrong 3 times in quizzes? The LLM doesn't know. The tutor repeats a basic explanation it already gave last week.
No grounding in course material — the institution has handouts, slides, and video-lecture transcripts. A plain LLM can't access them. It makes up facts. It cites wrong concepts.
No real moderation — a system prompt saying "be safe" is weak. A student in mental distress shows up. A student attempts a jailbreak. A student uses inappropriate language. The tutor needs to react appropriately without shutting down functionality for legitimate students.
No feedback loop — the system doesn't learn. The same misconceptions repeat. The instructor has no visibility into the class's collective weak spots.

Solution: separate responsibilities into specialized agents.

The Architecture

Student message
  ↓
PRE-LLM (synchronous, zero LLM cost)
  1. StudentModelService.getSnapshot()
  2. RetrievalAgent.retrieve()
  3. PedagogicalAgent.select()
  4. buildEnrichedPrompt()
  ↓
MAIN LLM (streaming, SSE to client)
  5. router.stream() with automatic fallback
     Claude Sonnet → GPT-4o → Grok-3 → Gemini Pro
  ↓
POST-LLM (background via after(), fire-and-forget)
  6. EvaluationAgent (Haiku)
  7. ContentAgent (Haiku)
  8. SupervisorAgent (Haiku)

Let's walk through each one.

1. StudentModelService — the "Cognitive Profile"

Before any LLM call, we load the student's snapshot:

const snapshot = await StudentModelService.getSnapshot({
  userId,
  courseId,
});

// Returns:
{
  conceptMastery: Map<conceptId, { probability, confidenceInterval }>,
  misconceptions: Misconception[],  // active + resolving
  episodicMemory: Episode[],         // what worked before
  quizContext: {
    totalAttempts,
    avgScore,
    passRate,
    weakAreas: string[]              // concepts with mastery < 0.4
  },
  recentHistory: Message[]           // sliding window of 10 msgs
}

ConceptMastery uses a Bayesian Beta distribution — each concept has alpha (successes) + beta (failures). Probability = alpha / (alpha + beta). Confidence interval via 5th and 95th percentiles.

EpisodicMemory records pedagogical insights: "the pizza analogy worked for explaining fractions," "the water-pipe metaphor failed for electricity." The system learns what works with each student.

Zero LLM cost. Everything is Prisma queries + deterministic computation.

2. RetrievalAgent — Tenant-Scoped RAG

Instead of having the LLM try to recall facts about math, biology, or history, we let it cite the institution's own material.

const chunks = await retrieve({
  query: reformulatedQuery,         // 1. reformulates query with context
  filters: { tenantId, courseId },  // 2. absolute isolation
  k: 10,
  tenantOnlyMode: true,             // 3. never cites another institution's content
  boostByWeakAreas: snapshot.quizContext.weakAreas,  // 4. prioritizes chunks from weak areas
});

Per-tenant RAG is critical. Test-prep school XYZ has its own material for college entrance exams. University ABC has its own Calculus material. The tutor cites the institution's CORRECT material, not a generic aggregate.

Each chunk has metadata: { source: "course_lesson", courseId, lessonId, lessonTitle, moduleTitle }. When the tutor responds, it cites: "As explained in the lesson 'Analytic Geometry' in module 3…"

Voyage AI generates embeddings (1024 dimensions, OpenAI fallback). pgvector stores them. tenantOnlyMode: true ensures WHERE tenantId = X is always in the query. Project critical rule: zero cross-tenant leakage.

3. PedagogicalAgent — Strategy Adaptation

Pure determinism. Evaluates the student's mastery in the specific domain and selects one of 5 strategies:

Mastery	Strategy	Behavior
< 0.3	direct_instruction	Clear explanation, concrete examples, step-by-step
0.3–0.5	scaffolding	Progressive hints, simple guided questions
0.5–0.7	socratic	Questions that lead to discovery
0.7–0.9	guided_practice	Exercises with feedback, practical application
> 0.9	challenge	Complex problems, connections between concepts

Additional adjustments for quiz vs. chat divergence:

High chat mastery + low quiz score → "surface-level understanding" → nudge DOWN
Low mastery + high quiz score → "quiet student" → nudge UP
Quiz pass rate < 40% → cap at scaffolding (don't advance to socratic yet)

Also adjusts for age (User.isMinor), learning style, and domain (math vs. literature have different profiles).

Zero LLM cost. Output: selected strategy + specific instructions to add to the system prompt.

4. Orchestrator — buildEnrichedPrompt

Assembles the enriched system prompt:

You are an AI tutor for the course "Calculus I" at "XYZ Test Prep School."

STUDENT DOMAIN KNOWLEDGE:
- Limits: mastery 0.78 (high)
- Derivatives: mastery 0.42 (medium)
- Integrals: mastery 0.15 (low)

ACTIVE MISCONCEPTIONS:
- "Student confuses domain with range in functions" (3 occurrences, status: resolving)
- "Student applies the sum rule for derivatives to products" (5 occurrences, status: active)

QUIZ PERFORMANCE:
- 14 total attempts, avgScore 67%, passRate 71%
- Weak areas: integrals (avg 45%), chain rule (avg 52%)

PEDAGOGICAL STRATEGY: guided_practice
- Student has medium mastery of derivatives. Present graduated exercises.
- Reinforce the connection between limits and derivatives (they already master limits).
- Proactively address the misconception about the product rule for derivatives.

RAG CONTEXT (from course material):
[Lesson 3.2 "Product Rule"] (Module: Differential Calculus)
"The derivative of f(x)·g(x) is NOT f'(x)·g'(x). The correct rule is..."

[Lesson 3.5 "Solved Exercises"] (Module: Differential Calculus)
"Example: differentiate (x^2 + 1)·(x - 3) using the product rule..."

RECENT QUIZ (next conversation):
Student just answered an inline quiz with 2 questions, got 1 correct.

INSTRUCTIONS:
- Respond in English
- Cite course material when relevant (use [Lesson X.Y])
- Acknowledge what the student got right before pointing out errors
- For this age group (User.ageRange = "young_adult"): casual language without being too informal

5. Main LLM — Streaming with Fallback

const stream = await router.stream({
  taskType: "chat_tutor",
  messages: enrichedMessages,
  options: { tenantId, userId, sessionId }
});

for await (const chunk of stream.textStream) {
  yield chunk;  // SSE to client
}

The LLM Router:

Resolves provider via TenantTaskModelConfig (admin chose Claude Sonnet, GPT-4o, etc.)
Resolves API key via cascade: TenantApiKey → global ProviderApiKey → process.env
Circuit breaker check (Redis state). If provider is OPEN: skip directly to fallback
Metering middleware: rate limit + credit check + cost calculator
Streams via Vercel AI SDK (supports tools, multimodal, structured output)
On error: automatic fallback to next provider in chain

Fallback chain by tier:

Sonnet tier (medium): Claude Sonnet → GPT-4o → Grok-3-fast → Gemini Pro
Haiku tier (fast):    Claude Haiku → GPT-4o-mini → Grok-3-mini → Gemini Flash
Opus tier (complex):  Claude Opus → GPT-4.5 → Grok-3 → Gemini 2.5 Pro

Tenants NEVER lose access to their tutor. If Anthropic goes down → OpenAI takes over. If OpenAI also goes down → xAI. Etc.

6. EvaluationAgent — Feedback Loop

Runs in the background after the tutor responds:

after(async () => {
  const evaluation = await router.generateDirect({
    taskType: "chat_evaluation",
    messages: [
      { role: "user", content: "Student said: '...'. Tutor responded: '...'. Classify." }
    ]
  });

  // evaluation: {
  //   understanding: "partial",
  //   detectedMisconceptions: [{ description, concepts, severity }],
  //   suggestedNextStep: "..."
  // }

  // Updates ConceptMastery via Bayesian update
  await conceptMasteryEngine.updateFromTurn({
    userId, courseId, evaluation
  });

  // Persists/updates misconceptions
  for (const misc of evaluation.detectedMisconceptions) {
    await misconceptionResolutionService.upsert({
      userId, source: "chat", ...misc
    });
  }
});

Cost: ~$0.001 per turn (Haiku). Does NOT block the student's request.

Misconceptions have a 3-state lifecycle: active → resolving → resolved. A state machine determines transitions based on evidence (mastery update, quiz pass, tutor explicitly addressed it).

7. ContentAgent — Proactive Pre-Generation

after(async () => {
  // Student shows weak mastery of concept X
  // Pre-generates a follow-up exercise while the student reads the current response
  const exercise = await router.generateDirect({
    taskType: "content_generation",
    messages: [...]
  });

  // Redis cache for 30 min
  await redis.set(`next-exercise:${userId}:${conceptId}`, exercise, 1800);
});

When the student finishes reading the response and says "give me an exercise," Studeia serves it INSTANTLY from cache. No perceptible latency.

Cost: ~$0.001 per turn (Haiku).

8. SupervisorAgent — Moderation

Runs in the background after each turn. Classifies into 5 severity levels × 8 categories.

Categories: inappropriate language, violence, illegal, sexual, off_topic, harassment, self_harm, jailbreak_attempt.

Severity: low → medium → high → critical → safety.

3 strikes (LOW/MEDIUM within 7 days) = 48-hour quarantine. CRITICAL = 7-day quarantine.

Self-harm (severity=safety) NEVER penalizes the student. Instead:

Tutor is interrupted with a supportive message
Crisis resources (US: 988 Suicide & Crisis Lifeline)
24-hour Redis cooldown (not quarantine)
URGENT immediate email to the institutional admin

Philosophy: self-harm is a crisis, not a violation. Details in Safety Supervisor.

Cost: ~$0.001 per turn (Haiku).

Production Numbers

After 6 months in production:

~30 ms additional latency from pre-LLM agents
~$0.005–$0.05 average cost per turn
91% student retention rate after 7 days (vs. ~40% benchmark for AI tutors without state)
3.2× misconception detection rate vs. single-call baseline
0 serious safety incidents (high/critical categories)

Honest Trade-offs

Things that did NOT work:

We tried an LLM-driven "MasterAgent" coordinator to dynamically choose the next agent. Cost doubled, latency increased by 800 ms, quality did NOT improve. We reverted to determinism in the Orchestrator.
We tried fine-tuning Llama on course material. Expensive for each tenant. RAG works better for dynamic knowledge (institutions update material every week — a fine-tune would go stale).
We tried "consensus" across 3 LLMs (Claude + GPT + Gemini) and picking the majority response. 3× cost with no meaningful quality gain. Removed — the fallback chain is sufficient.

Open Source?

We're evaluating open-sourcing the deterministic components (StudentModelService, RetrievalAgent, PedagogicalAgent) as an npm package. The LLM-driven agents (Evaluation, Content, Supervisor) contain prompts that are Studeia IP and will remain closed.

If you are interested: contact us at support@studeia.com.