UEval: A Benchmark for Unified Multimodal Generation

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu*, Zhuang Liu*

Princeton University

(* indicates co-advising)

What is UEval?

UEval comprises 1,000 expert-curated prompts that require both images and text in the model outputs, sourced from 8 diverse real-world domains.

Leaderboard

Frontier models consistently outperform open-source ones. GPT-5-Thinking achieves the highest score of 66.4, while the best open-source model Emu3.5 obtains 49.1.

Model Space Textbook Diagram Paper Art Life Tech Exercise Avg
Reference 96.2 94.4 93.1 96.2 90.6 87.7 90.6 89.2 92.2
GPT-5-ThinkingClosed 84.0 78.0 67.8 51.9 67.8 63.8 57.0 61.4 66.4
Gemini-2.5-FlashClosed 78.0 74.0 66.4 71.6 66.6 63.0 58.2 50.0 66.0
GPT-5-InstantClosed 77.3 77.9 62.3 55.1 71.2 69.7 50.7 57.6 65.2
Gemini-2.0-FlashClosed 65.2 55.2 47.6 45.8 70.4 58.0 50.2 48.0 55.1
Emu3.5Open 59.1 57.4 41.1 31.6 59.3 62.0 37.0 45.4 49.1
BAGELOpen 29.8 42.5 37.2 20.0 39.0 33.6 24.8 21.4 31.0
Janus-ProOpen 21.0 31.0 37.4 15.2 26.4 23.0 17.6 11.5 22.9
Show-o2Open 25.4 33.1 33.2 17.4 25.6 15.6 17.4 13.1 22.6
MMaDAOpen 10.8 20.0 14.2 13.3 15.7 15.8 12.4 12.6 14.4

Getting Started

pip install google-genai datasets pillow

export GEMINI_API_KEY="your-api-key"

python ueval_eval.py \
  --model_output_path your_outputs.json \
  --output_path results.json

See GitHub for full documentation.

Citation

@article{li2026ueval, title = {UEval: A Benchmark for Unified Multimodal Generation}, author = {Li, Bo and Yin, Yida and Chai, Wenhao and Fu, Xingyu and Liu, Zhuang}, journal = {arXiv preprint arXiv:2601.22155}, year = {2026} }