
Most AI models can write working code, but can they design something that’s actually delightful to use?
If you’ve ever asked an AI to generate a website, dashboard, or game, you’ve probably run into the same problem: yes, the code might run—but the buttons are clunky, the colors don’t match, and the overall experience feels off. Functional? Sure. Visually appealing or user-friendly? Not really.
That’s the gap Tencent is trying to close with its new AI benchmark, ArtifactsBench.
Unlike traditional tests that only check if AI-written code works, ArtifactsBench goes further—it judges how that code looks, feels, and interacts. Think of it like a design-savvy art critic built into an automated testing pipeline.
Here’s how it works: The AI is given one of 1,825 creative tasks—from building web apps and data visualizations to mini-games. Once the AI completes the task, ArtifactsBench safely runs the generated code, captures screenshots of how it behaves, and passes all of this to a “Multimodal Large Language Model” (MLLM) that acts as a judge. But this judge doesn’t just eyeball things—it uses a structured 10-point checklist to score functionality, UX, and aesthetics.
The result? A scoring system that aligns with human taste at an impressive 94.4% rate—much higher than previous methods.
Tencent tested over 30 leading AI models using ArtifactsBench. Surprisingly, general-purpose models outperformed specialized coding AIs. For example, Qwen-2.5-Instruct, a generalist, beat out Qwen’s own coding and vision-specific variants. Why? Because design isn’t just about code—it takes reasoning, instruction-following, and design intuition.
In other words, “good taste” in AI is a complex mix of logic and style—and ArtifactsBench might be our best tool yet to measure it.
With this, Tencent isn’t just asking, “Can your AI code?” It’s asking, “Can your AI create something people actually want to use?” And that’s a shift worth watching.