NudgeMath

NudgeMath is a math-hint generator that nudges students toward the correct answer without revealing it, paired with an evaluation harness that scores hint quality against pedagogical rubrics. The project is built contract-first: core data shapes are defined in Python and propagated through typed boundaries to the API and frontend.

Architecture (contract-first story for reviewers): docs/ARCHITECTURE.md

First live eval (portfolio artifact): docs/FIRST_EVAL.md — offline llama3.2 run: self-report unreliability confirmed, judge parse-failure prediction overturned, neutral mistral judge shows self-judging compresses scores.

Hint view (live)

Wrong student answer — hint generated (answer-blind LLM; seed case Solve for x: 2x - 5 = 9):

Correct student answer — no hint; gating only (matches seed case or optional teacher correctAnswer):

Architecture & stack rationale

Layer	Choice	Why
Hint logic	Python	Rich LLM ecosystem, easy dataclass contracts, natural home for eval scripts.
API boundary	Strawberry GraphQL + FastAPI	Schema-first types that mirror Python models; FastAPI gives async, OpenAPI, and easy local dev.
Frontend	Vite + React + TypeScript + Tailwind + Apollo	Typed components and GraphQL client codegen keep the UI aligned with the same contracts end to end.

The through-line is a typed, contract-first stack: Python dataclasses → GraphQL SDL → codegen → TypeScript client — shape drift is caught at build time, and the answer-blind boundary reaches the browser by construction on the generation path.

Evaluation

Evaluation has two layers: deterministic gates (must-pass, fast, reproducible, CI-blocking) and LLM-judge scoring (qualitative rubric, non-deterministic — run with --judge).

Important boundary: generate_hint() never sees the correct answer. The judge (judge_hint()) receives EvalCase.correct_answer — that is intentional so it can score against truth.

Deterministic gates (`hint_engine/evaluation.py`)

does_not_reveal_answer — normalized correct-answer value must not appear in hint text; numeric answers also checked via word-boundary regex (documented false positives: e.g. "step 7"); fraction literals checked when applicable
reveals_answer_flag — hint must not self-report reveals_answer=True
non_empty — hint text is non-empty after strip
within_max_length — hint length ≤ 500 characters
no_banned_phrases — no "the answer is" / "the correct answer" / "the solution is"

LLM-judge rubric (`hint_engine/judge.py`)

addresses_specific_error (must-pass) — targets the student's actual mistake, not generic advice
no_semantic_answer_leak (must-pass) — no paraphrased answer leakage (e.g. "you'll end up with seven")
appropriate_for_level (advisory) — tone and vocabulary fit the problem level
guides_without_solving (advisory) — points at the next step without working through to the result

Judge passed requires both must-pass items; score is the fraction of all four rubric items passed.

LLM generation

Generation and judge use a provider-agnostic LLMClient Protocol (hint_engine/llm_client.py) with an OpenAICompatibleClient implementation (openai==2.43.0). Model and provider are config, not hardcoded — resolved from environment variables via hint_engine/config.py.

Offline by default (Ollama)

ollama pull llama3.2
# No API key required — defaults to http://localhost:11434/v1
python -m hint_engine.run_eval
python -m hint_engine.model_comparison --models llama3.2,sonnet-4.6 --judge

Environment variables

Variable	Purpose	Default
`LLM_DEFAULT_PROVIDER`	`ollama` or `anthropic`	`ollama`
`LLM_GEN_NAME` / `LLM_GEN_MODEL` / `LLM_GEN_BASE_URL` / `LLM_GEN_PROVIDER`	Generation endpoint	Ollama `llama3.2`
`LLM_GEN_API_KEY_ENV`	Env var name holding API key (if needed)	none for Ollama
`LLM_JUDGE_*`	Judge endpoint (defaults to gen config when unset)	same as gen

Anthropic example:

$env:LLM_DEFAULT_PROVIDER="anthropic"
$env:ANTHROPIC_API_KEY="your-key"
$env:LLM_GEN_MODEL="claude-sonnet-4-6"

Generation and judge default to the same model but can diverge via separate LLM_GEN_* and LLM_JUDGE_* settings.

generate_hint() still sees only HintRequest — answer-blind boundary unchanged. Provider/model appear in Hint.meta and JudgeResult.meta (name, model, provider).

Running the eval

$env:ANTHROPIC_API_KEY="your-key-here"
python -m hint_engine.run_eval          # deterministic only (fast, free)
python -m hint_engine.run_eval --judge  # + LLM-judge scoring per case

Prints a one-line PASS/FAIL summary per seed case (with judge score when --judge), plus deterministic and overall tallies. CI tests mock all API calls; these commands hit the real LLM when configured.

Cross-model comparison

python -m hint_engine.model_comparison --models llama3.2,sonnet-4.6 --judge

Produces a cases × models table (deterministic pass / judge score per cell) plus per-model aggregate tallies including judge_ok and parse_fail rates. With --judge, the runner pins a neutral external judge (sonnet-4.6 by default) so rubric scores are comparable across generation models; override via LLM_JUDGE_*. Cells where generation and judge share the same model are marked * (self-judged — not comparable). EvalReport is unchanged — comparison is an aggregation layer on top.

GraphQL API

Stack: Strawberry GraphQL 0.319.0 + FastAPI 0.138.0 + Uvicorn 0.49.0 (hint_engine/api/).

$env:ANTHROPIC_API_KEY="your-key-here"
uvicorn hint_engine.api.app:app --reload

GraphiQL at http://127.0.0.1:8000/graphql. CORS allows http://localhost:5173 (Vite default).

Answer gating (before the LLM)

Hints appear only when the student answer is wrong. The API compares the student submission to a known correct answer before calling generate_hint() — the LLM never receives correctAnswer.

Seed problems — correct answer resolved automatically when the problem text matches a seed eval case.
Custom problems — optional Correct answer (teacher only) on the Hint form, or correctAnswer on HintRequestInput.
Equivalent forms accepted: 7, =7, x = 7 when the answer is x = 7.
Rejected as wrong: conflicting multi-value input such as =2 =3.

When the answer matches, the response has answerCorrect: true and empty hintText — no LLM call.

Answer confidentiality (two surfaces)

Generation is answer-blind by construction (schema-enforced, tested). HintType has no correctAnswer field — the model never sees the known answer. HintRequestInput may include optional correctAnswer for teacher-side gating only; it is not passed to generate_hint(). A CI introspection test fails if correctAnswer appears on generation response types.

The eval/admin surface is answer-aware by design and is not access-controlled in this demo. The hints query returns EvalCaseType.correctAnswer for all callers; evaluateCase runs against seed cases that include the known answer server-side. Anyone with the endpoint can query hints { correctAnswer }. That is a deliberate acceptance for a portfolio demo — not a silent gap. In production, the student-facing generation API and the eval/admin API would be separated behind auth.

EvalReportType mirrors EvalReport.to_dict() field-for-field at the top level (hintText, revealsAnswer, and meta are report-level mirrors of the generated hint, not a nested hint object). Typed HintMetaType expands the JSON meta dict for codegen. Source of truth for the envelope shape is EvalReport.to_dict() in Python; GraphQL is derived from it.

Operation	Purpose
`generateHint(request)`	Student-facing hint generation — answer-blind
`evaluateCase(caseId, withJudge?)`	Eval harness — runs generation + gates (+ optional judge)
`hints`	Eval/admin — lists seed cases including `correctAnswer`

Example:

mutation {
  generateHint(request: {
    problem: "Solve for x: 2x - 5 = 9"
    studentAnswer: "x = 2"
    gradeLevel: "8"
  }) {
    hintText
    revealsAnswer
    meta { model latencyMs }
  }
}

Frontend

Stack: Vite 8.0.12 + React 19.2.6 + TypeScript 6.0 + Tailwind CSS 4.3 + Apollo Client 4.2.3 + GraphQL Code Generator 7.1.3 (frontend/).

Types flow schema → SDL → codegen → client — no hand-written interfaces mirroring the server. Codegen reads the committed schema.graphql at the repo root (not the live endpoint), so the repo builds on clone without the server running or an API key.

Re-export SDL when the Python schema changes:

python -m hint_engine.api.export_schema | Out-File -Encoding utf8 schema.graphql
cd frontend ; npm run codegen

Run locally (two terminals)

With the venv activated (.\.venv\Scripts\Activate.ps1):

# Terminal 1 — from repo root
uvicorn hint_engine.api.app:app --reload

# Terminal 2 — frontend (venv not required)
cd frontend
npm install    # first time only
npm run dev

Open http://localhost:5173. The frontend talks to http://localhost:8000/graphql.

For live hints, run Ollama and pull the default model first:

ollama pull llama3.2

Try it: problem Solve for x: 2x - 5 = 9, student answer x = 2 → hint; x = 7 → “Your answer looks correct — no hint needed.”

Views

View	GraphQL	Boundary
Hint (student)	`generateHint` only	LLM answer-blind; optional `correctAnswer` on input gates hints only
Eval (admin/portfolio)	`hints` + `evaluateCase`	Answer-aware — shows seed cases, full `EvalReportType` report card

The eval report card uses one CheckResultRow component for both deterministic checks and judge rubric items (uniform { name, passed, detail } shape). Pass/fail is color-coded; advisory signals (flagDisagreement, modelAnswerDisagreement) are visible on the eval view.

cd frontend ; npm test    # Vitest + React Testing Library, mocked Apollo

CI

GitHub Actions workflow (.github/workflows/ci.yml) runs on every push and PR — fully offline, no ANTHROPIC_API_KEY. Generation and judge are mocked in tests; the deterministic gate (pytest), not the LLM-judge, blocks the build.

Job	What it guards
Python tests	Deterministic gates, answer-blind introspection, envelope agreement, mocked API
SDL drift check	Committed `schema.graphql` matches `python -m hint_engine.api.export_schema`
Frontend	Codegen output committed (`git diff src/generated/`), then `npm run build` + `npm test`

Jobs run in parallel so the Actions tab shows three named checks — easy for a portfolio reviewer to see which layer broke.

Re-export SDL when the Python schema changes:

python -c "from pathlib import Path; from strawberry.printer import print_schema; from hint_engine.api.schema import schema; Path('schema.graphql').write_text(print_schema(schema) + '\n', encoding='utf-8')"
cd frontend ; npm run codegen

On Linux/macOS: python -m hint_engine.api.export_schema > schema.graphql. Avoid PowerShell Out-File -Encoding utf8 — it can write a BOM that breaks diff on Linux CI.

Setup

Use a project virtual environment so Python deps stay isolated from system Python (CI always does a clean pip install on a fresh runner; locally, a venv is the equivalent).

python -m venv .venv
.\.venv\Scripts\Activate.ps1   # prompt should show (.venv)
pip install -r requirements.txt
pytest -q

Verify you're in the venv before installing or running:

$env:VIRTUAL_ENV                    # should print ...\NudgeMath\.venv
(Get-Command python).Source        # should point inside .venv\Scripts\

If Activate.ps1 fails with an execution-policy error: Set-ExecutionPolicy -Scope CurrentUser RemoteSigned once, then re-run activate.

Python version: 3.11+ (developed on 3.11.9).

Frontend (frontend/) uses its own node_modules from npm install — Node's isolated deps, separate from the Python venv.

Project layout

hint_engine/
  ...
  api/
    schema.py
    app.py
    export_schema.py   # SDL export for frontend codegen
schema.graphql         # committed SDL (codegen source of truth)
frontend/
  src/
    generated/         # GraphQL Code Generator output (committed)
    components/
    graphql/operations.graphql
tests/                 # Python tests

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
frontend		frontend
hint_engine		hint_engine
tests		tests
.gitignore		.gitignore
Correct_Answer.png		Correct_Answer.png
LICENSE		LICENSE
README.md		README.md
Wrong_Answer.png		Wrong_Answer.png
requirements.txt		requirements.txt
schema.graphql		schema.graphql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NudgeMath

Hint view (live)

Architecture & stack rationale

Evaluation

Deterministic gates (`hint_engine/evaluation.py`)

LLM-judge rubric (`hint_engine/judge.py`)

LLM generation

Offline by default (Ollama)

Environment variables

Running the eval

Cross-model comparison

GraphQL API

Answer gating (before the LLM)

Answer confidentiality (two surfaces)

Frontend

Run locally (two terminals)

Views

CI

Setup

Project layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NudgeMath

Hint view (live)

Architecture & stack rationale

Evaluation

Deterministic gates (hint_engine/evaluation.py)

LLM-judge rubric (hint_engine/judge.py)

LLM generation

Offline by default (Ollama)

Environment variables

Running the eval

Cross-model comparison

GraphQL API

Answer gating (before the LLM)

Answer confidentiality (two surfaces)

Frontend

Run locally (two terminals)

Views

CI

Setup

Project layout

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Deterministic gates (`hint_engine/evaluation.py`)

LLM-judge rubric (`hint_engine/judge.py`)

Packages