TAO Community Labs: AI-Enhanced Authoring for Education

TAO Studio Innovation Lab: AI-Enhanced Authoring for Education

At TAO, we are exploring how Large Language Models (LLMs) can transform the way assessments are created and shared. Our Innovation Lab initiative, TAO Studio, focuses on providing an accessible authoring environment that meets the needs of both professional assessment experts and everyday educators. By combining streamlined user experiences with interoperability through open standards like QTI 3.0, we aim to make assessment authoring more intuitive, inclusive, and scalable across classrooms, institutions, and national programs.

A key area of research is Automated Item Generation (AIG). Using prompt templates, fine-tuned LLMs, and human-in-the-loop review, TAO Studio explores how teachers and authors can generate high-quality assessment items faster, while ensuring relevance and validity. Sponsored by the Ministry of Economy in Luxembourg and developed in cooperation with the Luxembourg Institute of Technology, this work underscores TAO’s commitment to open innovation: developing AI-driven tools that empower the global education community, reduce barriers to content creation, and advance the open source assessment ecosystem.

1 Like

I wanted to share a summary of the recent evolution of our lab leveraging LLMs in cooperation with LIST, the Luxembourg Institute of Science and Technology,

What we tested

  • Had modern LLMs draft multiple-choice questions from short texts (science + a CS set).

  • Judged each draft against a clear checklist of 14 quality traits (clarity, difficulty, bias/inclusivity, etc.).

  • Used LLMs as validators (LLMs-as-a-Judge): they don’t write, they score items against the checklist above

Headline results

  • Strong agreement across validators in our controlled runs (~96% global accuracy).

    Paneling multiple models as validators improves reliability for example, MCC rose into the 90%+ range on traits like Correctness and Difficulty when combining models.

Important limits & risks (what didn’t work yet)

  • Grade-level alignment is fragile. When we intentionally “mutated” the student grade (e.g., Grade 2 ↔︎ Grade 12), most validators missed the issue; in one summary table we saw only ~5% of such mutants caught by at least one LLM.

  • On the same grade mutation test, recall for key traits like Scope and Background was low for several models (e.g., some recalls near 0.04–0.40), showing under-detection even when precision looked high.

  • Workload (is the task too long/short for the level?) is hard: recalls around 0.11–0.36 on several validators.

  • Model performance is uneven: some families needed formatting cleanup before scoring (operational friction for productization). (Operational note from experiments; not a blocker but a cost.)

  • Takeaway: validators are helpful but not sufficient alone teacher review remains essential.

What we’re changing next (in flight)

  • More context in prompts (explicit grade/subject/skill) so validators can judge with the right target profile, not just the text.

  • Stronger stress-tests via metamorphic “mutations” (e.g., swap correct answer, shift grade) to harden validators and track MCC/F1 by trait.

  • Diverse validator panels (mixing Llama, GPT, DeepSeek, Mistral) with agreement/voting to raise recall on tricky traits.

How this affects you

  • In tools you use, you’ll see traffic-light style scores per trait and a consensus verdict from multiple LLM validator but your approval stays final.

  • We’ll invite teachers to review edge cases (especially grade-level fit & workload). Your feedback directly tunes the validators.

Bottom line
LLM drafting + LLM validators already help on correctness/difficulty and agree well overall, but grade-level fit and workload detection are current weaknesses. We’re addressing these with richer prompts, diversified validator panels, and systematic stress-tests always keeping teacher judgment in the loop.