I wanted to share a summary of the recent evolution of our Innovation lab leveraging LLMs for Items generation in cooperation with LIST, the Luxembourg Institute of Science and Technology,
What we tested
-
Had modern LLMs draft multiple-choice questions from short texts (science + a CS set).
-
Judged each draft against a clear checklist of 14 quality traits (clarity, difficulty, bias/inclusivity, etc.).
-
Used LLMs as validators (LLMs-as-a-Judge): they don’t write, they score items against the checklist above
Headline results
-
Strong agreement across validators in our controlled runs (~96% global accuracy).
Paneling multiple models as validators improves reliability for example, MCC rose into the 90%+ range on traits like Correctness and Difficulty when combining models.
Important limits & risks (what didn’t work yet)
-
Grade-level alignment is fragile. When we intentionally “mutated” the student grade (e.g., Grade 2 ↔︎ Grade 12), most validators missed the issue; in one summary table we saw only ~5% of such mutants caught by at least one LLM.
-
On the same grade mutation test, recall for key traits like Scope and Background was low for several models (e.g., some recalls near 0.04–0.40), showing under-detection even when precision looked high.
-
Workload (is the task too long/short for the level?) is hard: recalls around 0.11–0.36 on several validators.
-
Model performance is uneven: some families needed formatting cleanup before scoring (operational friction for productization). (Operational note from experiments; not a blocker but a cost.)
-
Takeaway: validators are helpful but not sufficient alone teacher review remains essential.
What we’re changing next (in flight)
-
More context in prompts (explicit grade/subject/skill) so validators can judge with the right target profile, not just the text.
-
Stronger stress-tests via metamorphic “mutations” (e.g., swap correct answer, shift grade) to harden validators and track MCC/F1 by trait.
-
Diverse validator panels (mixing Llama, GPT, DeepSeek, Mistral) with agreement/voting to raise recall on tricky traits.
How this affects you
-
In tools you use, you’ll see traffic-light style scores per trait and a consensus verdict from multiple LLM validator but your approval stays final.
-
We’ll invite teachers to review edge cases (especially grade-level fit & workload). Your feedback directly tunes the validators.
Bottom line
LLM drafting + LLM validators already help on correctness/difficulty and agree well overall, but grade-level fit and workload detection are current weaknesses. We’re addressing these with richer prompts, diversified validator panels, and systematic stress-tests always keeping teacher judgment in the loop.