Community update: AI-generated items (OAT × LIST)

Patrick · October 3, 2025, 8:10am

I wanted to share a summary of the recent evolution of our Innovation lab leveraging LLMs for Items generation in cooperation with LIST, the Luxembourg Institute of Science and Technology,

What we tested

Had modern LLMs draft multiple-choice questions from short texts (science + a CS set).
Judged each draft against a clear checklist of 14 quality traits (clarity, difficulty, bias/inclusivity, etc.).
Used LLMs as validators (LLMs-as-a-Judge): they don’t write, they score items against the checklist above

Headline results

Strong agreement across validators in our controlled runs (~96% global accuracy).

Paneling multiple models as validators improves reliability for example, MCC rose into the 90%+ range on traits like Correctness and Difficulty when combining models.

Important limits & risks (what didn’t work yet)

Grade-level alignment is fragile. When we intentionally “mutated” the student grade (e.g., Grade 2 ↔︎ Grade 12), most validators missed the issue; in one summary table we saw only ~5% of such mutants caught by at least one LLM.
On the same grade mutation test, recall for key traits like Scope and Background was low for several models (e.g., some recalls near 0.04–0.40), showing under-detection even when precision looked high.
Workload (is the task too long/short for the level?) is hard: recalls around 0.11–0.36 on several validators.
Model performance is uneven: some families needed formatting cleanup before scoring (operational friction for productization). (Operational note from experiments; not a blocker but a cost.)
Takeaway: validators are helpful but not sufficient alone teacher review remains essential.

What we’re changing next (in flight)

More context in prompts (explicit grade/subject/skill) so validators can judge with the right target profile, not just the text.
Stronger stress-tests via metamorphic “mutations” (e.g., swap correct answer, shift grade) to harden validators and track MCC/F1 by trait.
Diverse validator panels (mixing Llama, GPT, DeepSeek, Mistral) with agreement/voting to raise recall on tricky traits.

How this affects you

In tools you use, you’ll see traffic-light style scores per trait and a consensus verdict from multiple LLM validator but your approval stays final.
We’ll invite teachers to review edge cases (especially grade-level fit & workload). Your feedback directly tunes the validators.

Bottom line
LLM drafting + LLM validators already help on correctness/difficulty and agree well overall, but grade-level fit and workload detection are current weaknesses. We’re addressing these with richer prompts, diversified validator panels, and systematic stress-tests always keeping teacher judgment in the loop.

Topic		Replies	Views
TAO Community Labs: AI-Enhanced Authoring for Education Innovation Labs	1	28	October 3, 2025
What I Learned at Learning Impact Europe: Why Openness Matters Assessment news	0	33	October 3, 2025
🎉 Big News: OAT’s Revamped Community Tools General	0	48	October 3, 2025
About the Innovation Labs category Innovation Labs	0	3	October 3, 2025
TAO - release notes Announcements tao , release-notes	9	156	April 30, 2026

Community update: AI-generated items (OAT × LIST)

Related topics