UEval: A Benchmark for Unified Multimodal Generation
Princeton University
(* indicates co-advising)
What is UEval?
UEval comprises 1,000 expert-curated prompts that require both images and text in the model outputs, sourced from 8 diverse real-world domains.
Leaderboard
Frontier models consistently outperform open-source ones. GPT-5-Thinking achieves the highest score of 66.4, while the best open-source model Emu3.5 obtains 49.1.
| Model | Space | Textbook | Diagram | Paper | Art | Life | Tech | Exercise | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Reference | 96.2 | 94.4 | 93.1 | 96.2 | 90.6 | 87.7 | 90.6 | 89.2 | 92.2 |
| GPT-5-ThinkingClosed | 84.0 | 78.0 | 67.8 | 51.9 | 67.8 | 63.8 | 57.0 | 61.4 | 66.4 |
| Gemini-2.5-FlashClosed | 78.0 | 74.0 | 66.4 | 71.6 | 66.6 | 63.0 | 58.2 | 50.0 | 66.0 |
| GPT-5-InstantClosed | 77.3 | 77.9 | 62.3 | 55.1 | 71.2 | 69.7 | 50.7 | 57.6 | 65.2 |
| Gemini-2.0-FlashClosed | 65.2 | 55.2 | 47.6 | 45.8 | 70.4 | 58.0 | 50.2 | 48.0 | 55.1 |
| Emu3.5Open | 59.1 | 57.4 | 41.1 | 31.6 | 59.3 | 62.0 | 37.0 | 45.4 | 49.1 |
| BAGELOpen | 29.8 | 42.5 | 37.2 | 20.0 | 39.0 | 33.6 | 24.8 | 21.4 | 31.0 |
| Janus-ProOpen | 21.0 | 31.0 | 37.4 | 15.2 | 26.4 | 23.0 | 17.6 | 11.5 | 22.9 |
| Show-o2Open | 25.4 | 33.1 | 33.2 | 17.4 | 25.6 | 15.6 | 17.4 | 13.1 | 22.6 |
| MMaDAOpen | 10.8 | 20.0 | 14.2 | 13.3 | 15.7 | 15.8 | 12.4 | 12.6 | 14.4 |
Getting Started
pip install google-genai datasets pillow export GEMINI_API_KEY="your-api-key" python ueval_eval.py \ --model_output_path your_outputs.json \ --output_path results.json
See GitHub for full documentation.
Citation
@article{li2026ueval,
title = {UEval: A Benchmark for Unified Multimodal Generation},
author = {Li, Bo and Yin, Yida and Chai, Wenhao and Fu, Xingyu and Liu, Zhuang},
journal = {arXiv preprint arXiv:2601.22155},
year = {2026}
}