Microsoft open-sources the 3.8B text-to-image model Lens with 0.84-second inference

According to ME News, on May 25 (UTC+8), Beating Monitor reported that Microsoft open-sourced the Lens series, a 3.8B parameter text-to-image foundational model. Lens achieves exceptional training efficiency while maintaining and surpassing the performance of mainstream 6B-class models. In peak BF16 TFLOPS normalized testing (excluding caption regeneration costs), training consumed only about 19.3% of the computational resources required by Alibaba’s Tongyi Lab’s Z-Image. Dual optimizations in data and architecture are at the core of this cost reduction. The training dataset, Lens-800M, contains 800 million image-text pairs. Unlike traditional short-text annotations, all samples were generated by GPT-4.1, with an average prompt length of 109 words, delivering extremely high semantic density. The model architecture employs 48 MMDiT blocks and a FLUX.2 semantic VAE. Text features are derived from GPT-OSS, with feature representations from layers 4, 12, 18, and 24 concatenated to enhance prompt adherence and multilingual generalization. Microsoft has released three weight variants tailored for different deployment environments. The default Lens model uses RL-tuned reinforcement learning fine-tuning and generates a 1024x1024 image in 3.15 seconds over 20 steps on a single NVIDIA H100 GPU. The distilled ultra-fast variant, Lens-Turbo, completes inference in just 4 steps, producing the same resolution image in only 0.84 seconds. The foundational Lens-Base version is a pure base model without RL or distillation, defaulting to 50 steps for generation. The entire series natively supports arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440x1440. The model weights are now available on Hugging Face, accessible via Safetensors and Diffusers formats under the MIT license. Inference code has also been同步 hosted on GitHub. By combining high-data-density training with ultra-fast inference, Lens significantly lowers the barrier for individual developers and academic researchers to deploy and reproduce large-scale Diffusion Transformer models. (Source: BlockBeats)