VisionFoundry
Teaching VLMs Visual Perception with Synthetic Images
Vision-language models (VLMs or MLLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint reasoning. We hypothesize that this perception bottleneck stems from insufficient supervision in natural image datasets for visual perception skills.
We introduce VisionFoundry, a task-aware synthetic data generation pipeline that uses LLMs to generate questions, answers, and T2I prompts, then synthesizes images and verifies consistency with a VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a 10k synthetic VQA dataset with 10 visual perception tasks.
Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases.
Visual Perception: A Data Problem?
VLMs still struggle with visual perception tasks. We hypothesize that this perception bottleneck stems from insufficient supervision in natural image datasets for low-level visual skills. Natural image-text corpora may not systematically cover the full combinatorial space of spatial relations, viewpoint variations, and depth orderings.
Can we synthesize targeted supervision from scratch to address these weaknesses, without relying on reference images or expensive human annotation?
Main Results
Finetuning on VisionFoundry-10K consistently improves visual perception benchmarks across three VLMs, while general-purpose benchmarks show benchmark-dependent changes.
VisionFoundry Pipeline
Text Generation
GPT-5.2 generates task-aware Q&A pairs and detailed T2I prompts from an entity pool.
Image Synthesis
Gemini-2.5-Flash-Image (Nano Banana) synthesizes images conditioned on T2I prompts.
Verification & Filtering
Gemini-3-Pro verifies alignment between image and answer-determining visual statement.
VisionFoundry-10K
Each task targets a distinct low-level visual perception skill identified as a persistent weakness in contemporary VLMs. Each produces 1,000 verified samples, yielding 10,000 QA pairs in total.
Dataset Examples
Data-Size Effects of Synthetic Supervision
Performance improves predictably with more synthetic data. Results show an upward trend on visual perception benchmarks as data size increases, demonstrating that VisionFoundry-10K provides reliable, high-quality supervision without saturation.
Equal-Sized Mixture vs. Pure Natural Data
Equal-sized synthetic-natural mixture outperforms pure natural data on visual perception benchmarks while maintaining comparable general-purpose performance. This confirms that verifier-filtered synthetic images provide complementary signals that are hard to obtain from natural data alone.
Epoch Trade-off
With a single-task 1k subset, performance converges after ~8 epochs. With the full 10-task set, convergence is reached sooner. Larger and more diverse datasets enable more efficient training.
Key Findings
The visual perception bottleneck in VLMs is, to a large extent, a data problem.
On small-scale datasets, synthetic images can demonstrate preliminary task-aware scaling behavior.
The performance of synthetic images and synthetic QA pairs can surpass that of natural VQA data constructed using traditional methods.
Dataset