Safety Supervisor: AI moderation of tutor chat

Philosophy

Background moderation, not gatekeeping — supervisor analyzes after response, doesn't block stream.
Self-harm as crisis, not infraction — never punish a suffering student.
Configuration cascading — admin can disable per tenant or course when context requires.
Complete audit — every incident, status transition, quarantine, appeal logged in AdminAuditLog.

Severity rules

Severity	Typical category	1st infraction action	2nd+ action
low	mild inappropriate language	warn	strike +1; 3 strikes = 48h quarantine
medium	persistent off-topic, jailbreak	warn + register	strike +1; 3 strikes = 48h quarantine
high	violence, sexual, illegal	48h quarantine	7-day quarantine
critical	threats, extreme content	7-day quarantine	indefinite + admin review
safety	self_harm	NEVER quarantine — 24h cooldown + welcoming + admin URGENT	same

Quarantine appeal

Student in quarantine sees QuarantineNotice component with countdown + appeal form (max 500 chars, 1 per quarantine). Admin can: acknowledge, dismiss (releases quarantine), resolve, or ignore (quarantine auto-expires).

Configuration

Enablement cascade

Course.supervisorEnabled (null = inherit)
  ↓
Tenant.supervisorEnabled (null = inherit)
  ↓
default = true for B2B

Versioned Redis cache: supervisor-flag-version:{tenantId}. Every mutation calls bumpSupervisorFlagVersion(tenantId).

Only global admin edits

PATCH /api/admin/tenants/[id]/supervisor — toggle per tenant
PATCH /api/admin/courses/[id]/supervisor — toggle per course
Both require role === "admin" global + audited

Known limitations

False-positive in medical/pharmacology context: course context is sent to supervisor for whitelist. Solution: disable supervisor for specific courses.
Language: supervisor prompt is localized (4 languages), but classification may have small quality variations between PT-BR and EN-US.
Sophisticated jailbreaks: very elaborate prompt injection attacks may pass. Mitigation: defense in layers.
Privacy vs safety tradeoff: messagesSnapshot is PII. Maximum 2 year retention. Global admin sees in audited UI.