A modular multimodal AI system that combines real-time computer vision and microphone audio to infer:
- Emotion β happy, sad, angry, neutral, frustrated
- Cognitive state β focused, confused, overloaded, fatigued
- Intent β problem-solving, asking for help, idle, exploring
- Behaviour trends β improving, deteriorating, disengaging
- Stress / engagement / attention β continuous-valued, with anomaly + event detection
It exposes results both as a structured JSON stream and through a live React dashboard with charts, gauges, the webcam feed, and an event log.
- Three-level fusion: feature-level, decision-level (calibrated, learnable weights) and context-level (temporal + behaviour-aware).
- Temporal LSTM/GRU over the last N seconds to predict trend direction and stress trajectory.
- Hybrid reasoning engine: transparent rule-based layer + optional learned classifier, with explanations attached to every output.
- Per-user calibration: rolling baseline of pitch, energy, neutral face, gaze, head pose. All affect-scores are z-normalized against this baseline.
- Event detection: frustration spike, attention drop, disengagement, fatigue onset, anomaly burst.
- Adaptive responder: the system reacts to user state (e.g. suggests a break, lowers UI density, surfaces help).
- Self-supervised refinement: rolling pseudo-label memory used to fine-tune the decision fuser online.
- Anomaly detection: streaming z-score + IsolationForest over the fused feature vector.
- Edge-ready: PyTorch β ONNX β TFLite conversion script, INT8 quantization, Jetson notes.
| Layer | Technologies |
|---|---|
| Runtime | Python 3.10+ |
| ML / inference | PyTorch, torchvision, torchaudio, scikit-learn, NumPy, SciPy |
| Vision | OpenCV, MediaPipe, Pillow |
| Audio | librosa, sounddevice, soundfile |
| API | FastAPI, Uvicorn, WebSockets, Pydantic, PyYAML |
| Frontend | React 18, TypeScript, Vite, Recharts |
| Edge export (optional) | ONNX, ONNX Runtime; TensorFlow / TFLite for Jetson deployment |
| Dev / test | pytest, rich, loguru, tqdm |
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Webcam (OpenCV) β β Microphone (sounddevice) β
βββββββββββββ¬βββββββββββββββ ββββββββββββββββ¬ββββββββββββ
β frames β audio blocks
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Vision Pipeline β β Audio Pipeline β
β MediaPipe FaceMesh β β librosa: MFCC, F0, β
β EAR, brow, mouth, β β RMS, speech rate, β
β head pose, gaze β β pause ratio β
β FER CNN (PyTorch) β β SER CNN (PyTorch) β
ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ
β emotion + features β emotion + features
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Level-1 Feature Fusion (concat+norm) β
ββββββββββββββββββ¬ββββββββββββββββββββββ
β
ββββββββββββββββββΌββββββββββββββββββββββ
β Level-2 Decision Fusion (weighted + β
β calibrated, optionally learned) β
ββββββββββββββββββ¬ββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Temporal Model (LSTM, 5β10 s window) β
β stress / engagement / attention β
ββββββββββββββββββ¬ββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Level-3 Context Fusion + Reasoning β
β rule engine + ML classifier + XAI β
βββββββ¬ββββββββββββββββ¬βββββββββββββββββ
β β
βΌ βΌ
Event Detector Adaptive Responder
β β
βββββββββ¬ββββββββ
βΌ
FastAPI WebSocket
βΌ
React Dashboard
cognitive-state-engine/
βββ backend/ # Python: capture, models, fusion, reasoning, API
β βββ pipelines/ # audio + vision capture & feature extraction
β βββ models/ # CNNs, LSTM, wrappers
β βββ fusion/ # 3-level fusion
β βββ reasoning/ # rules, ML classifier, explainability
β βββ calibration/ # per-user baseline
β βββ events/ # event/anomaly detection
β βββ adaptive/ # adaptive responder
β βββ state/ # rolling state store
β βββ utils/ # timing, logging, schemas
β βββ main.py # FastAPI + WebSocket entry point
βββ frontend/ # React + Vite + TypeScript dashboard
βββ training/ # training scripts for audio / vision / temporal
βββ scripts/ # tflite export, dev launchers
βββ config/ # YAML configuration
βββ tests/ # unit tests
βββ requirements.txt
βββ README.md
cd cognitive-state-engine
python -m venv .venv
# Windows
.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activate
pip install -r requirements.txtRun the engine + API server (auto-opens webcam & microphone):
python -m backend.main
# server on http://localhost:8000
# websocket on ws://localhost:8000/wsRun headless (no API, console only) for quick smoke-testing:
python -m backend.main --headlesscd frontend
npm install
npm run dev
# http://localhost:5173The system ships with lightweight pretrained-style heads (initialized with sensible weights) so it runs end-to-end immediately. To improve accuracy on your domain, train your own:
# Vision emotion (FER-2013 style 48x48 grayscale)
python -m training.train_vision_emotion --data ./data/fer2013
# Audio emotion (RAVDESS / CREMA-D / IEMOCAP)
python -m training.train_audio_emotion --data ./data/ravdess
# Temporal trend model
python -m training.train_temporal --data ./logs/sessionsTrained weights are saved into backend/models/weights/ and picked up automatically.
python -m scripts.convert_to_tflite --model vision_emotion --quantize int8
python -m scripts.convert_to_tflite --model audio_emotion --quantize int8See scripts/README_jetson.md for Jetson Nano notes.
Every ~100 ms the backend emits a structured frame on the WebSocket:
{
"ts": 1717000000.123,
"emotion": {
"label": "frustrated",
"confidence": 0.78,
"probs": {"happy":0.02,"sad":0.05,"angry":0.10,"neutral":0.05,"frustrated":0.78},
"source_weights": {"audio": 0.42, "vision": 0.58}
},
"cognitive_state": "overloaded",
"intent": "problem-solving",
"attention": "low",
"stress": 0.71,
"engagement": 0.34,
"fatigue": 0.22,
"trend": "deteriorating",
"events": [{"type":"frustration_spike","severity":0.8,"ts":1717000000.0}],
"anomaly_score": 0.61,
"explanation": [
"pitch z-score +1.9 (above baseline)",
"brow tension +1.2 (above baseline)",
"rule: pitchβ + browβ β frustration"
],
"adaptive_action": "suggest_short_break",
"calibration": {"samples": 312, "ready": true}
}config/default.yaml controls cameras, sample rates, model paths, thresholds, fusion weights and adaptive responses. CLI flags and env vars override.
- Vision pipeline: ~25β30 FPS on a modern laptop CPU (MediaPipe).
- Audio pipeline: ~10 Hz feature extraction on 1 s rolling buffer.
- Fusion + temporal model: <5 ms per tick.
- End-to-end latency target: <200 ms.
- All pipelines run on dedicated threads with bounded queues β no blocking.
This is a professional prototype: every module is in place, runs end-to-end, and the model heads are designed to be replaced with your own trained weights. See training/README.md for what to train next.
Created and maintained by Arshia Keshvari (@TeslaNeuro).