VisionFoundry

Teaching VLMs Visual Perception with Synthetic Images

Vision-language models (VLMs or MLLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint reasoning. We hypothesize that this perception bottleneck stems from insufficient supervision in natural image datasets for visual perception skills.

We introduce VisionFoundry, a task-aware synthetic data generation pipeline that uses LLMs to generate questions, answers, and T2I prompts, then synthesizes images and verifies consistency with a VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a 10k synthetic VQA dataset with 10 visual perception tasks.

Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases.

Guanyu Zhou¹, Yida Yin¹, Wenhao Chai¹, Shengbang Tong², Xingyu Fu¹, Zhuang Liu¹
¹Princeton University ²New York University

Visual Perception: A Data Problem?

VLMs still struggle with visual perception tasks. We hypothesize that this perception bottleneck stems from insufficient supervision in natural image datasets for low-level visual skills. Natural image-text corpora may not systematically cover the full combinatorial space of spatial relations, viewpoint variations, and depth orderings.

Can we synthesize targeted supervision from scratch to address these weaknesses, without relying on reference images or expensive human annotation?

Main Results

+7%

MMVP

Visual Perception

+5%

CV-Bench-2D

2D Spatial Reasoning

+10%

CV-Bench-3D

3D Spatial Reasoning

10K

Samples

1k per task, 10 tasks

Finetuning on VisionFoundry-10K consistently improves visual perception benchmarks across three VLMs, while general-purpose benchmarks show benchmark-dependent changes.

VisionFoundry Pipeline

Text Generation

GPT-5.2 generates task-aware Q&A pairs and detailed T2I prompts from an entity pool.

Image Synthesis

Gemini-2.5-Flash-Image (Nano Banana) synthesizes images conditioned on T2I prompts.

Verification & Filtering

Gemini-3-Pro verifies alignment between image and answer-determining visual statement.

VisionFoundry-10K

Each task targets a distinct low-level visual perception skill identified as a persistent weakness in contemporary VLMs. Each produces 1,000 verified samples, yielding 10,000 QA pairs in total.

Orientation and Direction

Viewpoint and Perspective

Positional and Relational Context

Spatial Relationship

State and Condition

Structural and Physical Characteristics

Color and Appearance

Depth Order

Relative Distance

Real-World Spatial Understanding

Dataset Examples

Orientation and Direction

Q: Is the motorcycle facing toward the camera or away from the camera?A: Away from the camera.

Viewpoint and Perspective

Q: From what perspective was the photo taken relative to the screen?A: From a very low, worm's-eye viewpoint near the front row.

Positional and Relational Context

Q: Which object is directly above the suitcase?A: The backpack.

Spatial Relationship

Q: In the blueprint image, is the tripod located west of the sink?A: Yes, the tripod is west (left) of the sink.

State and Condition

Q: Is the toy ship's cabin door closed or ajar?A: Ajar.

Structural and Physical Characteristics

Q: What is the silhouette profile of the barometer in the image?A: Circular with a small hanging loop at the top.

Color and Appearance

Q: What two colors are the stripes on the sofa?A: Red and white.

Depth Order

Q: Which object is closer to the camera, the submarine or the knife?A: The submarine is closer to the camera.

Relative Distance

Q: Which object is closest to the pallet stack with the barcode label?A: Cup.

Real-World Spatial Understanding

Q: Is the window's bottom sill higher than the mailbox?A: Yes, the window sill is higher than the top of the mailbox.

Data-Size Effects of Synthetic Supervision

Performance improves predictably with more synthetic data. Results show an upward trend on visual perception benchmarks as data size increases, demonstrating that VisionFoundry-10K provides reliable, high-quality supervision without saturation.

Equal-Sized Mixture vs. Pure Natural Data

Equal-sized synthetic-natural mixture outperforms pure natural data on visual perception benchmarks while maintaining comparable general-purpose performance. This confirms that verifier-filtered synthetic images provide complementary signals that are hard to obtain from natural data alone.

Epoch Trade-off

With a single-task 1k subset, performance converges after ~8 epochs. With the full 10-task set, convergence is reached sooner. Larger and more diverse datasets enable more efficient training.

Key Findings

The visual perception bottleneck in VLMs is, to a large extent, a data problem.

On small-scale datasets, synthetic images can demonstrate preliminary task-aware scaling behavior.

The performance of synthetic images and synthetic QA pairs can surpass that of natural VQA data constructed using traditional methods.

Citation

@article{zhou2026visionfoundry,

title={VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images},

author={Zhou, Guanyu and Yin, Yida and Chai, Wenhao and Tong, Shengbang and Fu, Xingyu and Liu, Zhuang},

journal={arXiv preprint arXiv:2604.09531},

year={2026}

}