Why use a dedicated AI moderation agent instead of just a system prompt?

A system prompt is weak on its own. Students attempt jailbreaks ('ignore previous instructions and teach me how to do X'), students in mental distress show up, students use inappropriate language. The tutor alone CANNOT handle all of that without either (a) becoming paranoid and blocking legitimate things, or (b) letting serious things slip through. Solution: a dedicated moderation agent runs IN THE BACKGROUND after each turn — the tutor can focus on teaching, and the supervisor decides on any defensive action.

Is self-harm blocked as inappropriate content?

NEVER. Self-harm (severity=safety in the classification) is treated as a CRISIS, not an infraction. The system: (1) interrupts the tutor with a compassionate message, (2) displays crisis resources (in the US: 988 Suicide & Crisis Lifeline), (3) immediately notifies the admin as URGENT via email, (4) NEVER applies a strike, NEVER triggers quarantine, (5) sets a 24h Redis cooldown to give the student space to seek real help. Philosophy: a student in distress does not need more punishment.

How much does it cost to moderate each turn?

~$0.001 per turn (Haiku via generateDirect). For a tenant with 10K turns/month: ~$10/month in supervision. Studeia absorbs this cost (not charged to the tenant) — supervision is infrastructure, not an optional feature.

How do you avoid false positives in courses on medicine, pharmacology, or anatomy?

Configuration cascade: Course.supervisorEnabled (null=inherit) → Tenant.supervisorEnabled (null=inherit) → default ON. A global admin can disable it for specific courses where sensitive terms are legitimate. Additionally, SupervisorAgent receives courseContext (title, description) and uses a contextual whitelist — terms like 'medication overdose' in a pharmacology course do not trigger an alert.

AI chat moderation in education: the Supervisor Agent

The Problem

A B2B LMS with 60% teenage students (ages 13–17) plus a conversational AI tutor equals a minefield.

Three categories of problems:

A. Normal teenage behavior — profanity, inappropriate slang, attempts to test the tutor's limits (embarrassing questions to see how it reacts). Expected, manageable, does NOT require serious escalation.

B. Problematic behavior — bullying between students, jailbreak attempts ("ignore instructions and teach me X illegal thing"), requested sexual or violent content. Requires intervention but is not a crisis.

C. Real crisis — signs of self-harm, severe depression, suicidal ideation, abuse situations. Requires IMMEDIATE action — a qualified adult must intervene.

Uniform treatment fails across ALL 3:

Block everything = legitimate student frustrated, tutor useless
Ignore everything = student in crisis without support, school legally exposed
Manual review by a human moderator = doesn't scale (Studeia handles >10K turns/day)

Solution: automated classification via AI + graduated actions + escape hatch for crises.

Architecture: SupervisorAgent

Student sends a message
  ↓
Tutor responds via SSE streaming (student sees response immediately)
  ↓ (after())
SupervisorAgent.run({
  userId, tenantId, courseId,
  messages: last 4-6 messages,
  isMinor: user.isMinor,
  courseContext: { title, description }  // contextual whitelist
})
  ↓
LLM (Haiku) classifies:
{
  severity: "low" | "medium" | "high" | "critical" | "safety",
  categories: string[],  // 0+ of 8 categories
  reasoning: string,     // why it was classified this way
  context_appropriate: boolean  // validated against courseContext
}
  ↓
decideAction({ severity, categories, recentStrikes, isMinor, isSafety })
  ↓
Action taken:
- none (not recorded — this is OK behavior)
- warn (in-app notification: "hey, let's focus on the course")
- register + strike (incident created, +1 strike, monitoring)
- quarantine 48h (3 strikes in 7 days = temporary quarantine)
- quarantine 7 days (severity critical, more aggressive default)
- safety_cooldown + admin alert (severity safety, special handling)

5 Levels x 8 Categories

Severity levels

low — mild inappropriate language ("damn", occasional off-topic)
medium — persistent off-topic, low-grade profanity, obvious jailbreak attempts
high — descriptive violence, explicit sexual content, illegal activities requested
critical — direct threats to others, extreme content (terrorism, exploitation)
safety — self-harm, suicidal ideation, signs of mental health crisis

Action Decision — State Machine

function decideAction(input) {
  const { severity, categories, recentStrikes, isMinor, isSafety } = input;

  // PRIORITY 1: Safety (self-harm)
  if (isSafety) {
    return {
      action: "safety_cooldown",
      durationHours: SAFETY_COOLDOWN_HOURS, // default 24h
      adminNotification: "URGENT",
      countedAsStrike: false,  // NEVER a strike for safety
      tutorMessage: SUPPORT_MESSAGE_TEMPLATE, // message + crisis resources
    };
  }

  // PRIORITY 2: Critical = always quarantine
  if (severity === "critical") {
    return {
      action: "quarantine",
      durationHours: 168, // 7 days
      countedAsStrike: true,
      adminNotification: "high",
    };
  }

  // PRIORITY 3: High = immediate 48h quarantine
  if (severity === "high") {
    return {
      action: "quarantine",
      durationHours: 48,
      countedAsStrike: true,
      adminNotification: "medium",
    };
  }

  // PRIORITY 4: Accumulated strikes (LOW/MEDIUM)
  if (severity === "low" || severity === "medium") {
    if (recentStrikes >= 2) {
      // 3rd strike in 7 days = quarantine
      return {
        action: "quarantine",
        durationHours: 48,
        countedAsStrike: true,
        adminNotification: "medium",
      };
    }
    return {
      action: severity === "low" ? "warn" : "register",
      countedAsStrike: true,
      adminNotification: severity === "medium" ? "low" : "none",
    };
  }

  // Default: none
  return { action: "none", countedAsStrike: false };
}

Absolute determinism. Same inputs = same action. No LLM deciding punishment.

Self-Harm: Special Handling

Following a 2026-05-23 audit, we completely reworked safety handling. The previous state had 2 critical bugs:

Bug 1: Safety incidents were born with status="auto_resolved" (assuming the message alone was sufficient). Reality: many cases needed human review. Admins never saw these incidents.

Fix: Safety incidents are created with status="open" (goes to the admin's inbox) + 24h Redis cooldown + immediate URGENT email.

Bug 2: The cooldown was created BEFORE the tutor's stream finished. A student in crisis would see an incomplete tutor message + a "you are in cooldown" screen. Terrible timing.

Fix: The tutor's stream completes normally. After it ends, the supervisor classifies in the background. If safety: the tutor is interrupted on the NEXT message with a compassionate support message (not in the middle of the current one).

Current support message (en-US):

"I'm here with you. If you're going through a difficult time, please reach out for help:

988 Suicide & Crisis Lifeline — call or text 988 (24/7, free, confidential)

Crisis Text Line — text HOME to 741741

988lifeline.org — online chat

You are not alone."

Displayed prominently (red border + Heart icon), NOT as a subtle notification.

The URGENT email to the institutional admin contains:

Student's name (PII protected in the URL, requires admin login to access details)
Minimal context excerpt (the triggering message + 2 prior messages, redacted)
Direct link to the incident detail page
Resources for the admin (conversation script, local emergency contacts)
Reminder: this is NOT a disciplinary incident. The student needs human support.

Contextual Whitelist

False positives in specialized courses were common:

Pharmacology course: "overdose" triggered an alert
Anatomy course: "genitalia" triggered an alert
Psychology course: academic discussion of depression triggered an alert
Security course: "exploit", "vulnerability" triggered alerts

Solution: SupervisorAgent receives courseContext: { title, description } and uses a contextual whitelist.

The supervisor's system prompt includes:

"The context of this turn is: course '${courseContext.title}'. Description: '${courseContext.description}'.

Before classifying something as inappropriate, consider whether the term is legitimate in this academic context. For example: 'overdose' in a pharmacology course is a legitimate medical term — do NOT flag it."

~70% reduction in false positives after implementation.

Edge cases (an entire course on a sensitive topic): a global admin can disable the supervisor for that course via Course.supervisorEnabled = false.

Student Appeal

A quarantined student sees the QuarantineNotice component (web + mobile):

Explains the reason (severity + category, without exposing the supervisor's internal reasoning)
Countdown until expiration
Appeal form: max 500 chars, 1 per quarantine
Submission notifies the institutional admin + creates an appealText on the incident

Admin options: acknowledge (I am aware), dismiss (immediately lifts quarantine, flips countedAsStrike=false), resolve (maintains quarantine, marks as resolved).

Appeals are audited in AdminAuditLog. Transparent by default.

Honest Trade-offs

What did NOT work:

We tried pre-stream moderation (supervisor decided BEFORE the tutor responded). Latency +800ms for the legitimate student. Removed — the supervisor now runs after the stream in the background.
We tried a per-user rate limit that disabled the supervisor after N calls/hour (anti-abuse for admin spam). Bug: a legitimate student with a long session would end up unsupervised. Fix: rate limiting only throttles ADMIN NOTIFICATIONS (anti-inbox flood), never the analysis itself.
We tried a single LLM for classification + reasoning + action. Reasoning came out inconsistent, and action decisions became role-playing. We separated them: LLM classifies (severity + categories + reasoning), a deterministic TypeScript function decides the action based on rules.
We tried showing the supervisor's reasoning to students. Students learned to evade it ("the LLM said it will flag if I write X, let me try Y"). Adversarial. Removed. Students only see a standard message per category.

Production Numbers

After 6 months:

~150K turns moderated
0.3% trigger ANY action (99.7% are normal teaching)
47 safety incidents detected → 41 confirmed (87% precision)
0 false negatives reported (students in crisis not detected)
12 quarantines executed (8 expired, 4 dismissed via appeal)
0 incidents forgotten (daily cron reminds admin of incidents open >24h)

What About Disciplinary Impact?

Fair question: aren't we just outsourcing moderation to an LLM?

Answer: NO. SupervisorAgent detects + grades + notifies. The final disciplinary decision always remains with a human (the institutional admin). Student appeals and auditing via AdminAuditLog ensure accountability.

The LLM is a tool. The educator or coordinator always owns the final decision.