UEval: A Benchmark for Unified Multimodal Generation

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu*, Zhuang Liu*

Princeton University

(* indicates co-advising)

read our paper view all UEval problems view source code

What is UEval?

UEval comprises 1,000 expert-curated prompts that require both images and text in the model outputs, sourced from 8 diverse real-world domains.

Leaderboard

Frontier models consistently outperform open-source ones. GPT-5-Thinking achieves the highest score of 66.4, while the best open-source model Emu3.5 obtains 49.1.

Model	Space	Textbook	Diagram	Paper	Art	Life	Tech	Exercise	Avg
Reference	96.2	94.4	93.1	96.2	90.6	87.7	90.6	89.2	92.2
GPT-5-ThinkingClosed	84.0	78.0	67.8	51.9	67.8	63.8	57.0	61.4	66.4
Gemini-2.5-FlashClosed	78.0	74.0	66.4	71.6	66.6	63.0	58.2	50.0	66.0
GPT-5-InstantClosed	77.3	77.9	62.3	55.1	71.2	69.7	50.7	57.6	65.2
Gemini-2.0-FlashClosed	65.2	55.2	47.6	45.8	70.4	58.0	50.2	48.0	55.1
Emu3.5Open	59.1	57.4	41.1	31.6	59.3	62.0	37.0	45.4	49.1
BAGELOpen	29.8	42.5	37.2	20.0	39.0	33.6	24.8	21.4	31.0
Janus-ProOpen	21.0	31.0	37.4	15.2	26.4	23.0	17.6	11.5	22.9
Show-o2Open	25.4	33.1	33.2	17.4	25.6	15.6	17.4	13.1	22.6
MMaDAOpen	10.8	20.0	14.2	13.3	15.7	15.8	12.4	12.6	14.4

Getting Started

pip install google-genai datasets pillow

export GEMINI_API_KEY="your-api-key"

python ueval_eval.py \
  --model_output_path your_outputs.json \
  --output_path results.json

See GitHub for full documentation.

Citation

@article{li2026ueval, title = {UEval: A Benchmark for Unified Multimodal Generation}, author = {Li, Bo and Yin, Yida and Chai, Wenhao and Fu, Xingyu and Liu, Zhuang}, journal = {arXiv preprint arXiv:2601.22155}, year = {2026} }