Skip to content
Studeia Docs

Safety Supervisor: AI moderation of tutor chat

AI moderation agent runs after each tutor turn. 5 severity levels, 8 categories. 3 strikes = 48h quarantine. Self-harm triggers welcoming response, crisis resources, and URGENT admin alert.

2026-06-03 8 min
Resposta curta

Studeia AI Supervisor Agent moderates the AI tutor chat in background after each turn using Claude Haiku (~$0.001/turn). Classifies in 5 severity levels (low/medium/high/critical/safety) and 8 categories (inappropriate language, violence, illegal, sexual, off-topic, harassment, self_harm, jailbreak_attempt). 3 strikes in 7 days = 48h quarantine. Self-harm (severity=safety) NEVER punishes — shows welcoming + crisis resources + URGENT admin alert.

Philosophy

  1. Background moderation, not gatekeeping — supervisor analyzes after response, doesn't block stream.
  2. Self-harm as crisis, not infraction — never punish a suffering student.
  3. Configuration cascading — admin can disable per tenant or course when context requires.
  4. Complete audit — every incident, status transition, quarantine, appeal logged in AdminAuditLog.

Severity rules

SeverityTypical category1st infraction action2nd+ action
lowmild inappropriate languagewarnstrike +1; 3 strikes = 48h quarantine
mediumpersistent off-topic, jailbreakwarn + registerstrike +1; 3 strikes = 48h quarantine
highviolence, sexual, illegal48h quarantine7-day quarantine
criticalthreats, extreme content7-day quarantineindefinite + admin review
safetyself_harmNEVER quarantine — 24h cooldown + welcoming + admin URGENTsame

Quarantine appeal

Student in quarantine sees QuarantineNotice component with countdown + appeal form (max 500 chars, 1 per quarantine). Admin can: acknowledge, dismiss (releases quarantine), resolve, or ignore (quarantine auto-expires).

Configuration

Enablement cascade

Course.supervisorEnabled (null = inherit)
  ↓
Tenant.supervisorEnabled (null = inherit)
  ↓
default = true for B2B

Versioned Redis cache: supervisor-flag-version:{tenantId}. Every mutation calls bumpSupervisorFlagVersion(tenantId).

Only global admin edits

  • PATCH /api/admin/tenants/[id]/supervisor — toggle per tenant
  • PATCH /api/admin/courses/[id]/supervisor — toggle per course
  • Both require role === "admin" global + audited

Known limitations

  • False-positive in medical/pharmacology context: course context is sent to supervisor for whitelist. Solution: disable supervisor for specific courses.
  • Language: supervisor prompt is localized (4 languages), but classification may have small quality variations between PT-BR and EN-US.
  • Sophisticated jailbreaks: very elaborate prompt injection attacks may pass. Mitigation: defense in layers.
  • Privacy vs safety tradeoff: messagesSnapshot is PII. Maximum 2 year retention. Global admin sees in audited UI.

See also

FAQ

How does Studeia protect students in AI chat?

Three layers: (1) Tutor system prompt includes guardrails. (2) Supervisor Agent (Haiku, background, ~$0.001) classifies each turn in 5 severity levels x 8 categories. (3) For self-harm (severity=safety), tutor is interrupted with welcoming message + crisis resources (in Brazil: CVV 188, SAMU 192) + URGENT admin notification.

If student writes something inappropriate, what happens?

Depends on severity. LOW (1st infraction): warning. MEDIUM: warning + incident logged. 3 strikes (LOW or MEDIUM in 7 days): 48h quarantine. CRITICAL: 7-day quarantine + admin notification. SAFETY (self-harm): NEVER punishment — welcoming + crisis resources + URGENT alert.

Does quarantine prevent student from using the platform?

Only AI tutor chat. Student keeps access to courses, lessons, materials, gradebook, messages with teacher. Student can submit appeal (max 500 chars, 1 per quarantine) reviewed by institutional admin.

Can I disable supervisor for a specific course?

Yes. Cascading: Course.supervisorEnabled (null = inherit) → Tenant.supervisorEnabled → default ON. Global admin edits via /admin/ai-supervisor/tenants. Useful for medical/pharmacology/anatomy courses where sensitive terms are legitimate.

Is self-harm treated as an infraction?

NEVER. Self-harm (severity=safety) is crisis, not infraction. System: (1) Interrupts tutor with welcoming message. (2) Shows crisis resources. (3) URGENT admin email immediately. (4) Creates incident in 'open' status for human review. (5) NEVER strikes or quarantines.

Veja tambem

Safety Supervisor: AI moderation of tutor chat