
AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
Interview with Kristen Edwards – MIT
As generative AI has the potential to accelerate the creation of an overwhelming number of design concepts, the challenge of evaluating those outputs becomes increasingly important.
Kristen Edwards, a researcher at MIT’s Decode Lab, returns to CDFAM Amsterdam this year to present her latest work on using vision-language models (VLMs) as evaluators/AI-judges for engineering design.
Building on her previous research into multimodal models for design exploration, she now shifts focus to the statistical and methodological tools needed to assess whether AI-generated evaluations align with (or exceed), human expert judgment.
In this interview, Kristen shares insights into the evolution of AI models over the past year, from agentic systems to improved multimodal reasoning, and details how she structures small, high-value datasets to enable meaningful evaluation.
She also discusses practical implications for early-stage design workflows, techniques for measuring AI reliability, and what she hopes to learn from the computational design community at CDFAM Amsterdam.
You presented at CDFAM Berlin last year on Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design. What have you been working on since then, and how have the tools or models available for this type of design evaluation evolved?
My research is focused on utilizing multimodal machine learning models for two main tasks: design exploration and design evaluation.
Since my presentation at CDFAM last year, I have been working on generating and validating 3D meshes directly from sketches during the conceptual design stage, in order to enhance design exploration. Here is a link to this work.

Most recently, however, my primary focus has been on how AI might serve as an evaluator in engineering design, especially important now that generative AI has made it easier than ever to produce large volumes of design concepts.
In the past year, there have been huge algorithmic strides in multimodal models, like large pre-trained vision-language models (VLMs), and big changes in the way AI is being utilized in engineering and manufacturing workflows. To name a few of the advancements and trends from the past year, we’ve seen:
- Agentic AI systems: systems that can act autonomously, making decisions and taking actions to achieve goals based on their reasoning, rather than direct human intervention,
- More powerful open source models: LLama 3, DeepSeek-R1, Qwen 2.5, and Mistral to name a few.
- Powerful reasoning models, including open-source ones, that show improved performance on many tasks using chain-of-thought answering.
- Improved multimodal models: For example, gpt-4o and gpt-image-1 have brought significant performance improvements on tasks that require understanding both image and text inputs.
- Continued buzz around generative AI
- But now, there’s not just generation, there is also evaluation. We’re now seeing increased attention on how to systematically evaluate these outputs to guide downstream decisions.
- LLM-as-a-Judge: Over the past year, there’s been a rise in research and real-world experimentation with large models, primarily just language models, as evaluators. These include both benchmarking tools and in-the-loop evaluators for human or AI-generated content. I’m interested in how multimodal models can serve as evaluators in the context of engineering design evaluation, and how to evaluate their performance.
That’s what I’ll be discussing at this year’s CDFAM: a statistical perspective on measuring whether AI-judges’ evaluations align with experts’ evaluations.

What kinds of tools or capabilities would you like to see become available to support this kind of AI-driven design evaluation? And in the meantime, what can engineers do to get the most out of the tools we have today?
Design evaluation varies widely depending on the stage of the design process and the nature of the metrics involved. In early-stage conceptual design, representations may be rough sketches, and evaluations tend to focus on subjective criteria such as creativity or novelty. Later in the process, during detailed design, evaluations may rely on rich 3D CAD models and focus on objective metrics like manufacturability or structural performance via simulation tools like CFD or FEA.
To support AI-driven design evaluation across this spectrum, I’d like to see tools that:
- Intake a variety of design representations, from sketches and natural language descriptions to full CAD assemblies
- Provide reliable evaluations of both subjective and objective metrics, ideally validated against expert or real-world outcomes
- Enable lightweight customization or fine-tuning, so engineers can align evaluations with their specific domain or design context without massive datasets
- Explain their reasoning, especially for subjective or qualitative metrics, to foster trust and insight
- Integrate with existing metrics and benchmarks to easily assess their results against a “ground-truth” evaluator, like a human expert.
There are already impressive AI tools in this space — for instance, SimScale (co-founded by David Heiny) allows engineers to run simulations in the cloud with AI-assisted preprocessing and setup. As more tools emerge, they’ll help streamline performance-driven evaluation. But for subjective or creative metrics, the AI tooling is still developing.
In the meantime, engineers can get the most out of existing AI tools by:
- Validating AI evaluators before relying on them. Ensure LLMs or VLMs are accurate by comparing their outputs to expert judgments or known benchmarks
- Using in-context learning to align pre-trained models with “ground truth” examples, which is often more feasible than training a model from scratch
- Treating AI as a collaborator, not a final rater. AI models can surface ideas, rank options, or provide second opinions, but human oversight remains critical, especially in high-stakes or novel design contexts
AI-based evaluators have the potential to reduce the evaluative load on experts and allow for design evaluation at large scales. But they should be employed as part of a human-AI team where measures are being taken to ensure that any AI-evaluations match that of the ground-truth evaluator.

Data scarcity is a recurring challenge in this space. How do you approach curating or structuring small expert datasets in a way that still enables meaningful model training and evaluation?
The progress in AI has been driven by advances in algorithms, computational power, and data, but in engineering and design, data scarcity is often the limiting factor.
Unlike web-scale datasets, expert design data is typically small, expensive to collect, and domain-specific. In the engineering domain, there is an important call for curating larger labeled datasets. However, in the meantime, rather than relying on large-scale training, I am exploring using small datasets more effectively.
Two strong approaches are:
1. In-Context Learning (ICL) ICL uses a few carefully chosen examples to prompt large pretrained models without retraining. This is especially useful when expert ratings or qualitative metrics (like creativity or manufacturability) are involved. I have found that well-structured prompts with intentional context examples can allow an AI-judge to output ratings that match an expert’s ratings at the same level or better than other experts.
2. Retrieval-Augmented Generation (RAG) RAG pairs a model with a retrieval system to surface relevant prior data at inference time. This is helpful when drawing on design precedents, technical specs, or domain knowledge; grounding outputs in real-world context and reducing hallucinations.
A compelling example of data-efficient modeling is TabPFN (V2 was recently published in Nature), which learns a prior over many tasks and performs well on new tabular datasets with very little data, a promising direction for engineering applications.
Human-aligned AI evaluation can be possible even with small datasets, if they are well-structured and paired with the right prompting or retrieval techniques.
Using AI for evaluation of large scale datasets holds promise if the evaluations are reliable. How do we measure reliability in this case, and how can users ensure that an AI-evaluator is reliable or aligned with a “ground truth” evaluator?
There are a number of metrics that you can use to determine if an AI-judge, or any judge for that matter, matches a “ground truth” judge.
Traditionally, this sort of problem has been looked at in –terms of interrater agreement among humans and, more recently, for LLM-as-a-judge. For interrater agreement, common metrics are intraclass correlation coefficient (ICC), Weighted Cohen’s Kappa, and Pearson’s correlation coefficient – all of these are measuring consistency between raters, but some account for chance. In LLM-as-a-judge works, the most common metrics are pure agreement and mean absolute error, which measure how close the actual values of ratings are.
How do you ensure the reliability of an AI-judge requires looking beyond simple error or agreement with ground-truth raters, and instead evaluating multiple facets of rating similarity, including ranking consistency, distributional similarity, and interrater reliability.
In general, I have found that metrics for judging how similar ratings are fall into the following categories:
- Interrater Reliability: Measures consistency while accounting for chance agreement (e.g., ICC, Cohen’s Kappa).
- Agreement: Tests whether the actual rating values are close (e.g., Bland–Altman plots, Concordance Correlation).
- Correlation & Ranking Consistency: Evaluates whether raters assign higher scores to the same items, focusing on trends or relative orderings (e.g., Pearson’s r, Spearman’s ρ, Jaccard similarity of top-k).
- Error Metrics: Quantifies how far apart the ratings are on average (e.g., MAE, RMSE).
- Equivalence Testing: Determines whether differences fall within a predefined margin of practical equivalence (e.g., TOST).
- Distributional Similarity: Compares the overall shape of the rating distributions (e.g., Kolmogorov–Smirnov test, Wilcoxon signed-rank).
The exact tests that are applicable will rely on features of the data, including if the ratings are ordinal or categorical, continuous or discrete, and if we expect the ratings to have a normal distribution.
How do you see these AI-based evaluation techniques fitting into real-world design workflows? Are there examples where this approach could reduce reliance on time-intensive expert review?
The area that I am closest to is engineering design and development, so that shapes my perspective.
Within that scope, I see AI-based evaluation techniques fitting especially well into the early- to mid-stage design workflow, particularly during concept selection, a phase that often requires expert judgment to downselect from a wide range of ideas.
This filtering or downselection step can be a bottleneck: designers or engineers generate dozens (sometimes hundreds) of concept sketches or models, but only a handful of experts are available to review them for qualities like feasibility, novelty, or alignment with design intent.
AI evaluators (particularly multimodal models that can interpret sketches, text descriptions, or CAD models) could act as a first-pass filter, helping prioritize the most promising candidates for deeper human review.
Some practical examples of where this could reduce reliance on time-intensive expert evaluation:
- Performing first-pass concept-selection during the conceptual design stage.
- Pre-screening generated designs from a generative AI pipeline before simulation or prototyping.
- Identifying outliers or low-quality submissions early on, freeing experts to focus on more meaningful comparisons.
- Design competitions or ideation workshops where hundreds of submissions need to be ranked or clustered. Google’s Project 10 to the 100th is a great example of the challenges of using experts to perform idea/design evaluation at scale.
What are you looking forward to sharing at CDFAM Amsterdam this year, and what kinds of feedback or collaboration are you hoping to find within the computational design community?
At CDFAM Amsterdam, I’ll be sharing my latest work on AI-based design evaluation: specifically, a statistical perspective for evaluating how well an AI-judge matches expert judgements in engineering design evaluations.
As generative tools accelerate design creation, design evaluation becomes a real challenge. I am interested in whether multimodal machine learning models can effectively assist in design evaluation – and how to assess that.
I’m particularly interested in engaging with researchers and practitioners across industry, government, and academia to understand where computational methods are meaningfully improving workflows, and just as importantly, where they fall short. Those cross-sector insights are crucial for identifying both high-impact applications and overlooked edge cases.
We’re already seeing LLMs deployed for evaluative tasks, from grading assignments and reviewing proposals to filtering candidates. But if we are putting AI in the position of a judge, we need rigorous methods to verify its outputs: ensuring that evaluations are reliable, consistent, and aligned with ground-truth or expert consensus.
I’m excited to hear from the computational design community if and where AI-evaluations would be useful, and what sort of reliability metrics would need to be met to ensure confidence when employing them.






