System Block Diagram

Component Descriptions

Frontend

Child Interface - The screen the child uses to log in, pick their companion, and interact with quizzes. Captures the child's voice answer through the browser microphone and sends it to the backend.
Video Player & Companion - Plays the YouTube video and displays the chosen companion character alongside it. The companion reacts and speaks using a voice provided by Hume AI.
Parent / Admin Dashboard - Lets parents set the interaction mode (Flexible, Strict, or Passive), manage their child's profile, review progress reports, and manage video activities.

Backend (FastAPI)

Auth & Permissions - Handles access code login for children and parents, and role-based access for admins.
Quiz Logic - Controls when questions appear during the video, enforces the interaction mode rules (pause, rewind, or continue), and records results.
Speech Recognition - Receives the child's recorded voice answer from the frontend, transcribes it to text, and passes it to Quiz Logic for evaluation.
AI Question Generation - Sends video frames and transcripts to an AI provider to generate quiz questions. Supports OpenAI, Anthropic, and Gemini.
Video Processing - Downloads videos from YouTube and prepares them for use in activities.

External Services

YouTube - Source of all video content.
OpenAI / Anthropic / Gemini - AI providers used to generate quiz questions from video content and evaluate child responses.
Hume AI - Provides the expressive companion voices that respond to the child during the quiz.

Data

SQLite Database - Stores user accounts, quiz results, progress history, and video metadata.
Local File Storage - Holds downloaded videos, extracted video frames, and generated question files.

A question appears and the video pauses (in Strict or Flexible mode).
The child speaks their answer aloud into the device microphone.
The browser captures the audio and sends it to the backend.
The backend transcribes the audio to text using Speech Recognition.
Quiz Logic evaluates the transcribed answer.
The result is sent back to the frontend - the companion reacts with spoken feedback via Hume AI and the video either continues or rewinds depending on the interaction mode.