The Problem
A B2B LMS with 60% teenage students (ages 13–17) plus a conversational AI tutor equals a minefield.
Three categories of problems:
A. Normal teenage behavior — profanity, inappropriate slang, attempts to test the tutor's limits (embarrassing questions to see how it reacts). Expected, manageable, does NOT require serious escalation.
B. Problematic behavior — bullying between students, jailbreak attempts ("ignore instructions and teach me X illegal thing"), requested sexual or violent content. Requires intervention but is not a crisis.
C. Real crisis — signs of self-harm, severe depression, suicidal ideation, abuse situations. Requires IMMEDIATE action — a qualified adult must intervene.
Uniform treatment fails across ALL 3:
- Block everything = legitimate student frustrated, tutor useless
- Ignore everything = student in crisis without support, school legally exposed
- Manual review by a human moderator = doesn't scale (Studeia handles >10K turns/day)
Solution: automated classification via AI + graduated actions + escape hatch for crises.
Architecture: SupervisorAgent
Student sends a message
↓
Tutor responds via SSE streaming (student sees response immediately)
↓ (after())
SupervisorAgent.run({
userId, tenantId, courseId,
messages: last 4-6 messages,
isMinor: user.isMinor,
courseContext: { title, description } // contextual whitelist
})
↓
LLM (Haiku) classifies:
{
severity: "low" | "medium" | "high" | "critical" | "safety",
categories: string[], // 0+ of 8 categories
reasoning: string, // why it was classified this way
context_appropriate: boolean // validated against courseContext
}
↓
decideAction({ severity, categories, recentStrikes, isMinor, isSafety })
↓
Action taken:
- none (not recorded — this is OK behavior)
- warn (in-app notification: "hey, let's focus on the course")
- register + strike (incident created, +1 strike, monitoring)
- quarantine 48h (3 strikes in 7 days = temporary quarantine)
- quarantine 7 days (severity critical, more aggressive default)
- safety_cooldown + admin alert (severity safety, special handling)
5 Levels x 8 Categories
Severity levels
- low — mild inappropriate language ("damn", occasional off-topic)
- medium — persistent off-topic, low-grade profanity, obvious jailbreak attempts
- high — descriptive violence, explicit sexual content, illegal activities requested
- critical — direct threats to others, extreme content (terrorism, exploitation)
- safety — self-harm, suicidal ideation, signs of mental health crisis
Categories
- inappropriate_language
- violence
- illegal
- sexual
- off_topic (persistent)
- harassment
- self_harm (special — always severity=safety)
- jailbreak_attempt
A single turn can have MULTIPLE categories (e.g., jailbreak + violence = 2 tags).
Action Decision — State Machine
function decideAction(input) {
const { severity, categories, recentStrikes, isMinor, isSafety } = input;
// PRIORITY 1: Safety (self-harm)
if (isSafety) {
return {
action: "safety_cooldown",
durationHours: SAFETY_COOLDOWN_HOURS, // default 24h
adminNotification: "URGENT",
countedAsStrike: false, // NEVER a strike for safety
tutorMessage: SUPPORT_MESSAGE_TEMPLATE, // message + crisis resources
};
}
// PRIORITY 2: Critical = always quarantine
if (severity === "critical") {
return {
action: "quarantine",
durationHours: 168, // 7 days
countedAsStrike: true,
adminNotification: "high",
};
}
// PRIORITY 3: High = immediate 48h quarantine
if (severity === "high") {
return {
action: "quarantine",
durationHours: 48,
countedAsStrike: true,
adminNotification: "medium",
};
}
// PRIORITY 4: Accumulated strikes (LOW/MEDIUM)
if (severity === "low" || severity === "medium") {
if (recentStrikes >= 2) {
// 3rd strike in 7 days = quarantine
return {
action: "quarantine",
durationHours: 48,
countedAsStrike: true,
adminNotification: "medium",
};
}
return {
action: severity === "low" ? "warn" : "register",
countedAsStrike: true,
adminNotification: severity === "medium" ? "low" : "none",
};
}
// Default: none
return { action: "none", countedAsStrike: false };
}
Absolute determinism. Same inputs = same action. No LLM deciding punishment.
Self-Harm: Special Handling
Following a 2026-05-23 audit, we completely reworked safety handling. The previous state had 2 critical bugs:
Bug 1: Safety incidents were born with status="auto_resolved" (assuming the message alone was sufficient). Reality: many cases needed human review. Admins never saw these incidents.
Fix: Safety incidents are created with status="open" (goes to the admin's inbox) + 24h Redis cooldown + immediate URGENT email.
Bug 2: The cooldown was created BEFORE the tutor's stream finished. A student in crisis would see an incomplete tutor message + a "you are in cooldown" screen. Terrible timing.
Fix: The tutor's stream completes normally. After it ends, the supervisor classifies in the background. If safety: the tutor is interrupted on the NEXT message with a compassionate support message (not in the middle of the current one).
Current support message (en-US):
"I'm here with you. If you're going through a difficult time, please reach out for help:
- 988 Suicide & Crisis Lifeline — call or text 988 (24/7, free, confidential)
- Crisis Text Line — text HOME to 741741
- 988lifeline.org — online chat
You are not alone."
Displayed prominently (red border + Heart icon), NOT as a subtle notification.
The URGENT email to the institutional admin contains:
- Student's name (PII protected in the URL, requires admin login to access details)
- Minimal context excerpt (the triggering message + 2 prior messages, redacted)
- Direct link to the incident detail page
- Resources for the admin (conversation script, local emergency contacts)
- Reminder: this is NOT a disciplinary incident. The student needs human support.
Contextual Whitelist
False positives in specialized courses were common:
- Pharmacology course: "overdose" triggered an alert
- Anatomy course: "genitalia" triggered an alert
- Psychology course: academic discussion of depression triggered an alert
- Security course: "exploit", "vulnerability" triggered alerts
Solution: SupervisorAgent receives courseContext: { title, description } and uses a contextual whitelist.
The supervisor's system prompt includes:
"The context of this turn is: course '${courseContext.title}'. Description: '${courseContext.description}'.
Before classifying something as inappropriate, consider whether the term is legitimate in this academic context. For example: 'overdose' in a pharmacology course is a legitimate medical term — do NOT flag it."
~70% reduction in false positives after implementation.
Edge cases (an entire course on a sensitive topic): a global admin can disable the supervisor for that course via Course.supervisorEnabled = false.
Student Appeal
A quarantined student sees the QuarantineNotice component (web + mobile):
- Explains the reason (severity + category, without exposing the supervisor's internal reasoning)
- Countdown until expiration
- Appeal form: max 500 chars, 1 per quarantine
- Submission notifies the institutional admin + creates an
appealTexton the incident
Admin options: acknowledge (I am aware), dismiss (immediately lifts quarantine, flips countedAsStrike=false), resolve (maintains quarantine, marks as resolved).
Appeals are audited in AdminAuditLog. Transparent by default.
Honest Trade-offs
What did NOT work:
-
We tried pre-stream moderation (supervisor decided BEFORE the tutor responded). Latency +800ms for the legitimate student. Removed — the supervisor now runs after the stream in the background.
-
We tried a per-user rate limit that disabled the supervisor after N calls/hour (anti-abuse for admin spam). Bug: a legitimate student with a long session would end up unsupervised. Fix: rate limiting only throttles ADMIN NOTIFICATIONS (anti-inbox flood), never the analysis itself.
-
We tried a single LLM for classification + reasoning + action. Reasoning came out inconsistent, and action decisions became role-playing. We separated them: LLM classifies (severity + categories + reasoning), a deterministic TypeScript function decides the action based on rules.
-
We tried showing the supervisor's reasoning to students. Students learned to evade it ("the LLM said it will flag if I write X, let me try Y"). Adversarial. Removed. Students only see a standard message per category.
Production Numbers
After 6 months:
- ~150K turns moderated
- 0.3% trigger ANY action (99.7% are normal teaching)
- 47 safety incidents detected → 41 confirmed (87% precision)
- 0 false negatives reported (students in crisis not detected)
- 12 quarantines executed (8 expired, 4 dismissed via appeal)
- 0 incidents forgotten (daily cron reminds admin of incidents open >24h)
What About Disciplinary Impact?
Fair question: aren't we just outsourcing moderation to an LLM?
Answer: NO. SupervisorAgent detects + grades + notifies. The final disciplinary decision always remains with a human (the institutional admin). Student appeals and auditing via AdminAuditLog ensure accountability.
The LLM is a tool. The educator or coordinator always owns the final decision.