Philosophy
- Background moderation, not gatekeeping — supervisor analyzes after response, doesn't block stream.
- Self-harm as crisis, not infraction — never punish a suffering student.
- Configuration cascading — admin can disable per tenant or course when context requires.
- Complete audit — every incident, status transition, quarantine, appeal logged in AdminAuditLog.
Severity rules
| Severity | Typical category | 1st infraction action | 2nd+ action |
|---|---|---|---|
| low | mild inappropriate language | warn | strike +1; 3 strikes = 48h quarantine |
| medium | persistent off-topic, jailbreak | warn + register | strike +1; 3 strikes = 48h quarantine |
| high | violence, sexual, illegal | 48h quarantine | 7-day quarantine |
| critical | threats, extreme content | 7-day quarantine | indefinite + admin review |
| safety | self_harm | NEVER quarantine — 24h cooldown + welcoming + admin URGENT | same |
Quarantine appeal
Student in quarantine sees QuarantineNotice component with countdown + appeal form (max 500 chars, 1 per quarantine). Admin can: acknowledge, dismiss (releases quarantine), resolve, or ignore (quarantine auto-expires).
Configuration
Enablement cascade
Course.supervisorEnabled (null = inherit)
↓
Tenant.supervisorEnabled (null = inherit)
↓
default = true for B2B
Versioned Redis cache: supervisor-flag-version:{tenantId}. Every mutation calls bumpSupervisorFlagVersion(tenantId).
Only global admin edits
PATCH /api/admin/tenants/[id]/supervisor— toggle per tenantPATCH /api/admin/courses/[id]/supervisor— toggle per course- Both require
role === "admin"global + audited
Known limitations
- False-positive in medical/pharmacology context: course context is sent to supervisor for whitelist. Solution: disable supervisor for specific courses.
- Language: supervisor prompt is localized (4 languages), but classification may have small quality variations between PT-BR and EN-US.
- Sophisticated jailbreaks: very elaborate prompt injection attacks may pass. Mitigation: defense in layers.
- Privacy vs safety tradeoff: messagesSnapshot is PII. Maximum 2 year retention. Global admin sees in audited UI.