Community update: AI-generated items (OAT × LIST)

I wanted to share a summary of the recent evolution of our Innovation lab leveraging LLMs for Items generation in cooperation with LIST, the Luxembourg Institute of Science and Technology,

What we tested

  • Had modern LLMs draft multiple-choice questions from short texts (science + a CS set).

  • Judged each draft against a clear checklist of 14 quality traits (clarity, difficulty, bias/inclusivity, etc.).

  • Used LLMs as validators (LLMs-as-a-Judge): they don’t write, they score items against the checklist above

Headline results

  • Strong agreement across validators in our controlled runs (~96% global accuracy).

    Paneling multiple models as validators improves reliability for example, MCC rose into the 90%+ range on traits like Correctness and Difficulty when combining models.

Important limits & risks (what didn’t work yet)

  • Grade-level alignment is fragile. When we intentionally “mutated” the student grade (e.g., Grade 2 ↔︎ Grade 12), most validators missed the issue; in one summary table we saw only ~5% of such mutants caught by at least one LLM.

  • On the same grade mutation test, recall for key traits like Scope and Background was low for several models (e.g., some recalls near 0.04–0.40), showing under-detection even when precision looked high.

  • Workload (is the task too long/short for the level?) is hard: recalls around 0.11–0.36 on several validators.

  • Model performance is uneven: some families needed formatting cleanup before scoring (operational friction for productization). (Operational note from experiments; not a blocker but a cost.)

  • Takeaway: validators are helpful but not sufficient alone teacher review remains essential.

What we’re changing next (in flight)

  • More context in prompts (explicit grade/subject/skill) so validators can judge with the right target profile, not just the text.

  • Stronger stress-tests via metamorphic “mutations” (e.g., swap correct answer, shift grade) to harden validators and track MCC/F1 by trait.

  • Diverse validator panels (mixing Llama, GPT, DeepSeek, Mistral) with agreement/voting to raise recall on tricky traits.

How this affects you

  • In tools you use, you’ll see traffic-light style scores per trait and a consensus verdict from multiple LLM validator but your approval stays final.

  • We’ll invite teachers to review edge cases (especially grade-level fit & workload). Your feedback directly tunes the validators.

Bottom line
LLM drafting + LLM validators already help on correctness/difficulty and agree well overall, but grade-level fit and workload detection are current weaknesses. We’re addressing these with richer prompts, diversified validator panels, and systematic stress-tests always keeping teacher judgment in the loop.