VisionFoundry

Teaching VLMs Visual Perception with Synthetic Images

Vision-language models (VLMs or MLLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint reasoning. We hypothesize that this perception bottleneck stems from insufficient supervision in natural image datasets for visual perception skills.

We introduce VisionFoundry, a task-aware synthetic data generation pipeline that uses LLMs to generate questions, answers, and T2I prompts, then synthesizes images and verifies consistency with a VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a 10k synthetic VQA dataset with 10 visual perception tasks.

Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases.

Guanyu Zhou1, Yida Yin1, Wenhao Chai1, Shengbang Tong2, Xingyu Fu1, Zhuang Liu1
1Princeton University    2New York University
VisionFoundry overview

Visual Perception: A Data Problem?

VLMs still struggle with visual perception tasks. We hypothesize that this perception bottleneck stems from insufficient supervision in natural image datasets for low-level visual skills. Natural image-text corpora may not systematically cover the full combinatorial space of spatial relations, viewpoint variations, and depth orderings.

Can we synthesize targeted supervision from scratch to address these weaknesses, without relying on reference images or expensive human annotation?

Main Results

+7%
MMVP
Visual Perception
+5%
CV-Bench-2D
2D Spatial Reasoning
+10%
CV-Bench-3D
3D Spatial Reasoning
10K
Samples
1k per task, 10 tasks

Finetuning on VisionFoundry-10K consistently improves visual perception benchmarks across three VLMs, while general-purpose benchmarks show benchmark-dependent changes.

VisionFoundry Pipeline

VisionFoundry pipeline overview
1

Text Generation

GPT-5.2 generates task-aware Q&A pairs and detailed T2I prompts from an entity pool.

2

Image Synthesis

Gemini-2.5-Flash-Image (Nano Banana) synthesizes images conditioned on T2I prompts.

3

Verification & Filtering

Gemini-3-Pro verifies alignment between image and answer-determining visual statement.

VisionFoundry-10K

Each task targets a distinct low-level visual perception skill identified as a persistent weakness in contemporary VLMs. Each produces 1,000 verified samples, yielding 10,000 QA pairs in total.

Orientation and Direction
Viewpoint and Perspective
Positional and Relational Context
Spatial Relationship
State and Condition
Structural and Physical Characteristics
Color and Appearance
Depth Order
Relative Distance
Real-World Spatial Understanding

Dataset Examples

Orientation and Direction
Orientation and Direction
Q: Is the motorcycle facing toward the camera or away from the camera?A: Away from the camera.
Viewpoint and Perspective
Viewpoint and Perspective
Q: From what perspective was the photo taken relative to the screen?A: From a very low, worm's-eye viewpoint near the front row.
Positional and Relational Context
Positional and Relational Context
Q: Which object is directly above the suitcase?A: The backpack.
Spatial Relationship
Spatial Relationship
Q: In the blueprint image, is the tripod located west of the sink?A: Yes, the tripod is west (left) of the sink.
State and Condition
State and Condition
Q: Is the toy ship's cabin door closed or ajar?A: Ajar.
Structural and Physical Characteristics
Structural and Physical Characteristics
Q: What is the silhouette profile of the barometer in the image?A: Circular with a small hanging loop at the top.
Color and Appearance
Color and Appearance
Q: What two colors are the stripes on the sofa?A: Red and white.
Depth Order
Depth Order
Q: Which object is closer to the camera, the submarine or the knife?A: The submarine is closer to the camera.
Relative Distance
Relative Distance
Q: Which object is closest to the pallet stack with the barcode label?A: Cup.
Real World Spatial Understanding
Real-World Spatial Understanding
Q: Is the window's bottom sill higher than the mailbox?A: Yes, the window sill is higher than the top of the mailbox.

Data-Size Effects of Synthetic Supervision

Data-Size Effects of Synthetic Supervision

Performance improves predictably with more synthetic data. Results show an upward trend on visual perception benchmarks as data size increases, demonstrating that VisionFoundry-10K provides reliable, high-quality supervision without saturation.

Equal-Sized Mixture vs. Pure Natural Data

Equal-sized mixture vs pure natural data

Equal-sized synthetic-natural mixture outperforms pure natural data on visual perception benchmarks while maintaining comparable general-purpose performance. This confirms that verifier-filtered synthetic images provide complementary signals that are hard to obtain from natural data alone.

Epoch Trade-off

Training epoch effects

With a single-task 1k subset, performance converges after ~8 epochs. With the full 10-task set, convergence is reached sooner. Larger and more diverse datasets enable more efficient training.

Key Findings

The visual perception bottleneck in VLMs is, to a large extent, a data problem.

On small-scale datasets, synthetic images can demonstrate preliminary task-aware scaling behavior.

The performance of synthetic images and synthetic QA pairs can surpass that of natural VQA data constructed using traditional methods.

Citation

@article{zhou2025visionfoundry,
title={VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images},
author={Guanyu Zhou and Yida Yin and Wenhao Chai and Shengbang Tong and Xingyu Fu and Zhuang Liu},
journal={arXiv preprint arXiv:2604.09531},
year={2025}
}