Avatar & TTS: a real-time talking AI tutor

How it works (single-provider)

The avatar is opt-in per course and uses one provider for both voice and video, so there is no audio/lip desync and no separate TTS step:

HeyGen → LiveAvatar API (FULL mode): the backend creates a session token and starts a LiveKit room; the client connects to LiveKit for video and speaks by publishing a speak_text event on the LiveKit data channel. HeyGen does TTS + video.
D-ID → clips/streams: the backend proxies SDP/ICE; the tutor speaks with text and D-ID does the TTS.

The client connects WebRTC directly to the provider — the video never passes through Studeia's server. The backend only creates the session, proxies speak/sdp/ice (D-ID) and records usage on stop.

Configuration

Per tenant: connect a HeyGen or D-ID key (encrypted AES-256-GCM), test it, and set a monthly minute cap.
Per course: Course.avatarProvider, avatarId, avatarVoiceId, avatarQuality and the avatarEnabled flag. One key → many avatars (avatar is a per-session parameter).

Security & quota

The master API key never goes to the client; only ephemeral session/LiveKit tokens do. Speak/SDP/ICE are proxied server-side with AvatarSession.userId ownership checks.
The monthly cap (monthlyMinuteCap) is checked before starting a session (fail-closed → quota_exceeded). Usage and cost are written to AvatarUsageLog.
Gating: avatarEnabled + a configured provider/avatar on the course + an active enrollment + the student's opt-in.

Graceful degradation

full_avatar → audio_only (TTS + static image) → text_only, so the tutor always responds even if the avatar provider fails.

Mobile

On mobile the avatar runs in a WebView that loads the same /avatar-embed page used on web (no native WebRTC modules in Expo); a React Native bridge forwards control messages.

Not yet (roadmap)

Voice input — the student speaking to the tutor (speech → STT → chat) — is already implemented as a separate feature (B2B, dictation): the speech becomes text and fills the message field for the student to review and send (no auto-send). The avatar, however, is output-only: two-way voice conversation (the avatar replying in a loop to speech) is not yet implemented.