RAG Ingestion: AI tutor with course material

How to ingest course content into per-tenant RAG for the AI tutor. Slides, video transcripts, PDFs, quizzes. Voyage AI embeddings (1024 dims). Full & incremental modes. Auto-sync on lesson edit.

By Studeia Team 2026-05-23 6 min

Resposta curta

RAG Ingestion in Studeia allows the AI tutor to cite your course material. POST /api/institution/courses/[id]/rag-ingest extracts text from lessons (slides, videos with transcription, PDFs, quizzes), chunks at 800 tokens with 200 overlap, generates embeddings via Voyage AI (1024 dims, OpenAI fallback) and stores with tenant+course filter. Full or incremental modes. autoSyncRag=true reingests automatically on lesson edit.

How it works

POST /api/institution/courses/[courseId]/rag-ingest
Body: { "mode": "full" | "incremental" }
  ↓
1. List published lessons in course
2. Extract text per type (rich_text → strip HTML, slides → text+notes, quiz → Q+A, pdf → OCR, video → transcription)
3. Chunking: 800 tokens, 200 overlap
4. Embeddings via Voyage AI (1024 dims, fallback OpenAI)
5. Create ContentBlock + ContentEmbedding with metadata
6. Final status in CourseRagIngestion

Modes

mode: "full": Deletes ALL ContentBlock + ContentEmbedding of course and re-ingests everything. Use for first ingestion or major reorganization.

mode: "incremental": Identifies lessons modified after last ingestion, deletes only those chunks and reingests. Recommended for production.

tenantOnlyMode

RetrievalAgent runs with tenantOnlyMode: true — tutor NEVER cites content from another tenant, even if similar content exists in database.

Limitations

Images: not converted to embeddings. Roadmap: auto description via vision LLM.
Math equations: extracted as LaTeX-like text. Quality depends on original markup.
Videos without transcription: not ingested. Configure auto-transcription first.
Max size per course: pgvector supports millions of vectors but retrieval latency grows. >10K chunks per course can be slow.

See also

AI Tutor Overview

FAQ

Does the AI tutor cite my course material?

Yes, but you need to ingest the course in RAG first. POST /api/institution/courses/[id]/rag-ingest (modes: full | incremental). System extracts text from all published lessons, chunks, generates embeddings via Voyage AI and stores with tenantId+courseId filter. After this, all tutor conversations about the course cite the correct material.

How to update RAG when I edit a lesson?

Set Course.autoSyncRag=true. Every lesson edit via API triggers incremental reingest via after() (background, doesn't block admin). Alternatively, trigger manually: POST /api/institution/courses/[id]/rag-ingest body {mode: 'incremental'}.

How much does it cost to ingest a course?

Embeddings via Voyage AI: ~$0.00005/1K tokens. Medium course (30 lessons, 50K words = ~70K tokens) costs ~$0.004 in embeddings on first ingestion. Incremental is proportional to delta.

What lesson types are ingested?

rich_text (HTML strip), slides (text elements + speaker notes), quiz (question + explanation), pdf (document-extractor), video (LiveClassTranscription when available), assignment (instructions). external_link and live_class without transcription are not ingested.

Veja tambem

AI Tutor Overview

RAG Ingestion: AI tutor with course material