How it works
POST /api/institution/courses/[courseId]/rag-ingest
Body: { "mode": "full" | "incremental" }
↓
1. List published lessons in course
2. Extract text per type (rich_text → strip HTML, slides → text+notes, quiz → Q+A, pdf → OCR, video → transcription)
3. Chunking: 800 tokens, 200 overlap
4. Embeddings via Voyage AI (1024 dims, fallback OpenAI)
5. Create ContentBlock + ContentEmbedding with metadata
6. Final status in CourseRagIngestion
Modes
mode: "full": Deletes ALL ContentBlock + ContentEmbedding of course and re-ingests everything. Use for first ingestion or major reorganization.
mode: "incremental": Identifies lessons modified after last ingestion, deletes only those chunks and reingests. Recommended for production.
tenantOnlyMode
RetrievalAgent runs with tenantOnlyMode: true — tutor NEVER cites content from another tenant, even if similar content exists in database.
Limitations
- Images: not converted to embeddings. Roadmap: auto description via vision LLM.
- Math equations: extracted as LaTeX-like text. Quality depends on original markup.
- Videos without transcription: not ingested. Configure auto-transcription first.
- Max size per course: pgvector supports millions of vectors but retrieval latency grows. >10K chunks per course can be slow.