i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

(Project Page Under Construction)

Boya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu

† Corresponding author

Princeton University

Benchmark comparison showing i1 among leading text-to-image models.

We investigate the design space of text-to-image diffusion models to understand how modeling and data choices affect model capabilities. This exploration culminates in i1, a 3B-parameter model that performs competitively with leading open-weight models at 1024-resolution, as measured by the average percentage score across GenEval, DPG-Bench, PRISM, CVTG-2K, and LongText-Bench. We open-source our model, code, and data to support future research.

Abstract

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models.

Curated general image generation examples from i1.

Curated showcase of i1 in general image generation.

Curated text-rendering examples from i1.

Curated showcase of i1 in text-rendering.

Motivation

Recent text-to-image models are strong, but their progress is hard to attribute. Many leading open-weight systems release checkpoints without the training data or full recipe, and most technical reports bundle architecture, training, and data decisions into a single model with limited ablation.

We study the design space through controlled experiments, then combine the designs that work into i1. The result is a fully open model that uses public datasets, simple components, and a recipe that can be studied and built upon.

300+ controlled modeling and data experiments
700K+ TPU v6e hours in the investigation
3B parameters in the final i1 model
29.5 average % points gain over the best prior fully open model
High-level illustration of the final i1 model recipe.

High-level illustration of our final i1 model. Rather than introducing major new network modules, i1 combines carefully selected modeling and data design choices into a simple and strong text-to-image model.

Controlled Experiment Setup

All controlled experiments start from the same 256-resolution pre-training baseline and vary one design choice at a time. This makes the ablations interpretable: modeling changes are not accumulated across experiments, and data experiments use the same evaluation protocol.

The baseline uses a LightningDiT-style cross-attention backbone with QK-norm, long skip connections, T5Gemma-2B as the text encoder, and FLUX.2 VAE. We evaluate with DPG-Bench, PRISM-Bench, and LongText-Bench to cover general prompt following, aesthetics, and text rendering.

High-level illustration of the baseline controlled-experiment model.

High-level illustration of our baseline for controlled experiments. We build a standard cross-attention architecture on top of LightningDiT and add QK-norm for training stability. We also include long skip connections, an underused design choice that we revisit and find helpful for performance.

baseline variant DPG ↑ PRISM ↑ LongText ↑
cross-attention 84.66 56.4 0.211
single-stream 85.89 55.6 0.293
dual-stream 86.82 58.3 0.439

Benchmark performance of baselines. The cross-attention variant is the default controlled-experiment baseline, while selected findings are validated across single-stream and dual-stream variants.

Model Designs

Text and Noise Conditioning

The text encoder study compares encoder-decoder models, decoder-only LLMs/VLMs, and CLIP-style encoders under the same image-generation training setup. The strongest options are encoder-decoder models, especially T5Gemma variants.

Text encoder benchmark performance across DPG, PRISM, and LongText.

Text encoders' performance across benchmarks. Under our modeling setup, the encoder-decoder T5Gemma models outperform representative decoder-only LLM/VLMs and CLIP-style models.

Finding 1

Both encoder-decoder models and decoder-only LLMs/VLMs can be competitive text encoders for text-to-image diffusion models.

Design: We use T5Gemma-2B, an encoder-decoder model, as the text encoder in i1 since it is one of the strongest models in our comparison.

Combining multiple text encoders helps, but further ablations show the gain can largely be reproduced by increasing adapter capacity for a single encoder. This gives a simpler and cheaper route because the text sequence length does not grow.

text encoder DPG ↑ PRISM ↑ LongText ↑
T5Gemma-2B84.6656.40.211
repeat w/1 MLP84.9355.80.225
repeat w/2 MLP85.0956.50.309

Concatenating two copies of T5Gemma-2B feature sequences and using two separate MLP adapters (equivalent to combining two T5Gemma-2B text encoders) yields a similar improvement as combining different text encoders, whereas using a shared MLP adapter does not. This suggests the improvement may come from additional adapter parameters, not separate text encoders.

Larger text encoder adapter performance across backbone architectures.

Using larger adapters for the text encoder consistently improves performance across backbone architectures. Beyond 2 transformer blocks, using larger adapters brings marginal further gains.

MLP adapter (default) transformer adapter (1x block)
#params DPG PRISM LongText #params DPG PRISM LongText
cross-attn
default0.89B84.6656.40.2110.91B86.3358.70.414
+2 encoders0.90B85.37 ↑ 0.7158.8 ↑ 2.40.272 ↑ 0.0610.94B86.47 ↑ 0.1459.5 ↑ 0.80.491 ↑ 0.077
no pooled emb0.89B85.98 ↑ 1.3257.5 ↑ 1.10.391 ↑ 0.1800.91B86.37 ↑ 0.0459.4 ↑ 0.70.446 ↑ 0.032
no timestep0.89B82.58 ↓ 2.0854.7 ↓ 1.70.185 ↓ 0.0260.91B84.71 ↓ 1.6258.9 ↑ 0.20.418 ↑ 0.004
no AdaLN0.66B84.99 ↑ 0.3357.4 ↑ 1.00.351 ↑ 0.1400.67B85.13 ↓ 1.2059.7 ↑ 1.00.413 ↓ 0.001
single-stream
default0.82B85.8955.60.2930.83B87.6460.00.472
+2 encoders0.83B84.89 ↓ 1.0056.3 ↑ 0.70.439 ↑ 0.1460.87B87.29 ↓ 0.3559.0 ↓ 1.00.428 ↓ 0.044
no AdaLN0.57B87.38 ↑ 1.4959.0 ↑ 3.40.390 ↑ 0.0970.58B87.39 ↓ 0.2559.5 ↓ 0.50.410 ↓ 0.062
dual-stream
default1.24B86.8258.30.4391.25B87.6760.70.576
+2 encoders1.25B87.34 ↑ 0.5259.6 ↑ 1.30.514 ↑ 0.0751.29B87.76 ↑ 0.0960.8 ↑ 0.10.588 ↑ 0.012
no AdaLN1.01B87.82 ↑ 1.0060.3 ↑ 2.00.508 ↑ 0.0691.02B87.38 ↓ 0.2960.7 0.00.554 ↓ 0.022

Impact of text and noise conditioning when using an MLP vs transformer adapter. A larger transformer adapter improves performance with minimal added parameters, multiple encoders provide much smaller benefit with the larger adapter, and removing AdaLN conditioning barely degrades performance.

Finding 2

The gains from using multiple text encoders can be similarly captured by increasing the adapter capacity for a single text encoder. Compared to using multiple encoders, increasing adapter capacity has lower memory and compute cost because it does not increase the text sequence length.

Design: We use a single, strong text encoder with an expressive text encoder adapter in i1.

Finding 3

Despite the large number of parameters introduced by AdaLN, conditioning on pooled text embeddings, timestep embeddings, or both through AdaLN provides only marginal benefit.

Design: We do not use AdaLN in i1.

Backbone Architecture

For the backbone, the controlled experiments revisit long skip connections and compare cross-attention, single-stream, and dual-stream families across model sizes. The strongest recipe uses a dual-stream MMDiT backbone with long skip connections.

Long skip connection scaling results.

Long skip connections can improve the performance-parameter trade-off for dual-stream models.

Finding 4

Long skip connections can improve the performance-parameter trade-off.

Design: We use long skip connections in i1.

Backbone family scaling comparison.

Backbone family. We compare cross-attention, single-stream, and dual-stream backbones across model sizes and find that the dual-stream backbone achieves the best overall performance.

Finding 5

The dual-stream backbone has the best trade-off between performance and parameter count among the cross-attention, single-stream, and dual-stream backbone families.

Design: We use the dual-stream backbone in i1.

Data Designs

Synthetic Captions and Prompt Rewrite

The data experiments first isolate caption generation on ImageNet-22K. Stronger VLM captioners lead to better downstream text-to-image models, so the final recipe uses long captions generated by Qwen3-VL-30B-A3B.

Synthetic captioner benchmark comparison.

The choice of synthetic captioner is important for downstream text-to-image performance. Due to resource constraints, captions and training are only on ImageNet-22K images rather than the full image dataset.

Finding 6

The choice of VLM used to synthesize captions for training images has a substantial impact on downstream text-to-image performance.

Design: We use Qwen3-VL-30B-A3B to generate synthetic captions for training i1, since captions from this model lead to strong downstream performance.

Training on long captions produces stronger overall models, but short GenEval prompts expose a train-test length mismatch. Repeating short prompts helps, but LLM-based prompt rewriting is more natural and gives the strongest GenEval score when paired with long-caption training.

% of long captions
in training captions
original prompts
(short)
repeated prompts rewritten prompts
(long)
12× 20×
0%0.470.550.340.240.60
20%0.470.540.530.500.67
40%0.350.590.550.540.70
60%0.370.600.570.540.73
80%0.260.570.540.470.73
100%0.170.480.490.460.73

Training captions and inference prompts should have aligned lengths (each number is a GenEval score). Training only on long captions and using LLM-based prompt rewriting to increase inference prompt length leads to the strongest performance.

Prompt: a photo of a wine glass and a bear
Wine glass and bear, train short test original. Wine glass and bear, train long test original. Wine glass and bear, train long test repeated. Wine glass and bear, train long test rewritten.
Prompt: a photo of a zebra right of a parking meter
Zebra and parking meter, train short test original. Zebra and parking meter, train long test original. Zebra and parking meter, train long test repeated. Zebra and parking meter, train long test rewritten. train: short,
test: original (short)
train: long,
test: original (short)
train: long,
test: repeated 12× (long)
train: long,
test: rewritten (long)

Examples from models trained on ImageNet-22K caption variants and tested on GenEval prompt variants. Training on long captions leads to weaker performance on short prompts, but prompt repetition and rewrite mitigate this.

Finding 7

Training on short captions weakens overall performance, whereas training on long captions yields stronger models but performs poorly on short test prompts. Prompt rewrite addresses this weakness on short prompts by expanding them, making training on long captions preferable overall.

Design: We only use long captions to train i1, and apply inference-time prompt rewriting.

Dataset Mixing

Single-dataset experiments show that ImageNet-22K and YFCC are strong real-image sources, iNaturalist is weak in this setup, and text rendering depends heavily on specialized text-rich datasets such as TextAtlas. Subsampling each dataset to 1M images leaves the broad trends intact.

Single-dataset benchmark results for full datasets.
full dataset
Single-dataset benchmark results for 1M subsets.
1M subset

Benchmark performance for single-dataset training. Among real datasets, ImageNet-22K and YFCC perform best, while iNaturalist performs worst. Text rendering capability relies strongly on specialized text-rich image datasets. Across most datasets, performance changes only marginally after subsampling each dataset to 1M.

Ablation of real, synthetic, and text-rendering image groups.

Real, synthetic, and text-rendering images are all important for model performance. Removing any of them leads to inferior performance on at least one benchmark.

Threshold-based dataset weighting results.

Threshold-based weighting. By default, the sampling weight of a dataset is its number of images. We explore dataset-level balancing by capping the sampling weights for all datasets at four hand-picked thresholds and find that lower thresholds, meaning more even weights, generally lead to stronger performance.

datasets DPG ↑ PRISM ↑ LongText ↑
full85.1458.20.335
remove iNaturalist85.5658.70.384
remove iNaturalist + Megalith85.1359.00.438
remove iNaturalist + Megalith + Places85.1857.90.453

Removing the weakest real-image datasets one by one under equal weighting, based on single-dataset results. Removing iNaturalist improves all benchmark scores, while further removing Megalith and Places does not.

Finding 8

Training the model on equal numbers of images from each dataset, counting repetitions (i.e., equal weighting across datasets), is a simple and effective dataset mixing strategy.

Design: We equally weight all selected datasets at each training stage of i1.

Repeating data is less damaging than one might expect once the mixture is diverse. Even when every dataset is reduced to 0.4M images, performance drops only slightly compared with the full mixture.

ImageNet-22K subsampling results.

Subsampling the ImageNet-22K dataset has little effect on performance; performance only substantially degrades at 0.1M. Using more captions per image leads to a stronger boost under limited image data.

subset size for each dataset unique #imgs seen DPG ↑ PRISM ↑ LongText ↑
full88.1M*85.5658.70.384
1.0M11.0M85.3457.70.384
0.4M4.4M84.6757.70.382
0.1M1.1M84.7157.40.349

Subsampling mixtures of datasets. Starting from the final data recipe, subsampling each dataset to 0.4M images gives 4.4M instead of 88.1M unique images seen and only slightly degrades model performance. *The datasets contain 162.9M images in total, but 88.1M better estimates the number of unique images seen during training.

Finding 9

With a diverse mix of image datasets, using fewer unique images and more training epochs causes marginal performance degradation in text-to-image diffusion training.

Design: We subsample high-resolution images to reduce storage requirements for i1 training.

i1 Training and Evaluation

The final model combines the controlled-experiment findings: a dual-stream MMDiT backbone, long skip connections, a single T5Gemma-2B text encoder with a 2-block transformer adapter, no AdaLN, both sinusoidal and RoPE positional embeddings, shared sandwich normalizations, long captions, equal dataset weighting, and prompt rewriting at inference.

overall architecture of i1.
overall architecture
One transformer block of i1.
one transformer block

The architecture of our final i1 model. Building on an MMDiT backbone, we use a large text encoder adapter consisting of 2 transformer blocks, remove noise-conditioning (i.e., AdaLN), add long skip connections, combine both sinusoidal and RoPE positional embeddings, and share sandwich normalizations across text and image streams.

training stage #images training steps batch size training timestep shift value TPU v5p-128 hours
256-resolution162.9M2.0M512N/A383.0
512-resolution9.7M0.5M512N/A174.4
1024-resolution4.3M0.3M1283.33150.9

Training configurations and compute resources for the final i1 model at each training stage.

Benchmark performance during 256-resolution pre-training.

Benchmark performance of i1 during 256-resolution pre-training. Performance stabilizes around 500K iterations and largely converges by 2M iterations.

Prompt: Argentinian soccer star Lionel Messi in the heat of the 2022 FIFA World Cup Final against France... (240 words)
PRISM prompt after 100K iterations. PRISM prompt after 200K iterations. PRISM prompt after 500K iterations. PRISM prompt after 2M iterations.
Prompt: An appealing poster designed in warm bohemian style announcing a folk music concert event... (109 words)
LongText prompt after 100K iterations. LongText prompt after 200K iterations. LongText prompt after 500K iterations. LongText prompt after 2M iterations. 100K iterations 200K iterations 500K iterations 2M iterations

Example generated images at different iterations of 256-resolution training. Overall image quality and text-rendering capability improve throughout the training run, mirroring the benchmark score improvements.

Benchmark performance at 512-resolution with different training sets.

Benchmark performance of i1 at 512-resolution with different training sets. PRISM and LongText improve substantially with 512-resolution training, even when text rendering data is not used.

256-resolution text rendering example 1. 256-resolution text rendering example 2.
256-resolution model
512-resolution text rendering example 1. 512-resolution text rendering example 2.
512-resolution model

Text rendering improves substantially after 512-resolution training, as demonstrated by example images generated from our 256-resolution and 512-resolution checkpoints using the same input prompts from LongText-Bench.

Finding 10

Training a model to achieve strong high-resolution generation does not require high-resolution training data to match the full breadth of the low-resolution pre-training data.

Design: We do not further expand the resolution-filtered high-resolution datasets for i1 training.

At inference, i1 uses CFG scale 12, Rescale CFG with strength 1, and a single prompt-rewriting meta-prompt for all input prompts.

model #params GenEval DPG-Bench PRISM CVTG-2K LongText-Bench
API call only
GPT Image 1 [High]-0.84*85.15*-0.8569*0.956*
Seedream 3.0-0.84*88.27*-0.5924*0.896*
Open weights only
FLUX.1 [Dev]12B0.66*83.84*65.10.4965*0.607*
SD3 Medium2B0.62*84.08*61.90.40370.322
Janus-Pro-7B7B0.80*84.19*60.00.06670.019*
BAGEL14B0.88*85.4461.80.36420.373*
HiDream-I1-Full17B0.83*85.89*66.10.77380.543*
Lumina-Image 2.03B0.73*87.20*63.50.15770.088
Z-Image6B0.84*88.14*74.20.8671*0.935*
Qwen-Image20B0.87*88.32*73.90.8288*0.943*
Open weights + data + training code
BLIP3o-4B4B0.7779.7353.20.03530.023
PixNerd1B0.73*80.9*53.30.00060.020
DeCo1B0.86*81.4*53.10.00140.003
BLIP3o-N-S3B0.8781.9856.80.24930.110
BLIP3o-N-G-G3B0.9081.9357.50.24420.114
BLIP3o-N-G-T3B0.8679.7756.80.33300.153
i1 (Ours)3B0.8486.7370.10.85310.922

Performance on representative text-to-image benchmarks. i1 achieves state-of-the-art performance among fully open models on all five benchmarks except GenEval and outperforms several leading weight-only models.

Citation

@article{zeng2026i1,
  title={i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models},
  author={Zeng, Boya and Luo, Tianze and Pu, Shu and Shen, Jucheng and Lu, Taiming and Sarch, Gabriel and Liu, Zhuang},
  journal={arXiv preprint arXiv:2606.11289},
  year={2026}
}