According to ME News, on May 25 (UTC+8), Beating Monitor reported that Microsoft open-sourced the Lens series, a 3.8B parameter text-to-image foundational model. Lens achieves exceptional training efficiency while maintaining and surpassing the performance of mainstream 6B-class models. In peak BF16 TFLOPS normalized testing (excluding caption regeneration costs), training consumed only about 19.3% of the computational resources required by Alibaba’s Tongyi Lab’s Z-Image. Dual optimizations in data and architecture are at the core of this cost reduction. The training dataset, Lens-800M, contains 800 million image-text pairs. Unlike traditional short-text annotations, all samples were generated by GPT-4.1, with an average prompt length of 109 words, delivering extremely high semantic density. The model architecture employs 48 MMDiT blocks and a FLUX.2 semantic VAE. Text features are derived from GPT-OSS, with feature representations from layers 4, 12, 18, and 24 concatenated to enhance prompt adherence and multilingual generalization. Microsoft has released three weight variants tailored for different deployment environments. The default Lens model uses RL-tuned reinforcement learning fine-tuning and generates a 1024x1024 image in 3.15 seconds over 20 steps on a single NVIDIA H100 GPU. The distilled ultra-fast variant, Lens-Turbo, completes inference in just 4 steps, producing the same resolution image in only 0.84 seconds. The foundational Lens-Base version is a pure base model without RL or distillation, defaulting to 50 steps for generation. The entire series natively supports arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440x1440. The model weights are now available on Hugging Face, accessible via Safetensors and Diffusers formats under the MIT license. Inference code has also been同步 hosted on GitHub. By combining high-data-density training with ultra-fast inference, Lens significantly lowers the barrier for individual developers and academic researchers to deploy and reproduce large-scale Diffusion Transformer models. (Source: BlockBeats)
Microsoft open-sources the 3.8B text-to-image model Lens with 0.84-second inference
KuCoinFlashShare






On May 25, Microsoft open-sourced its 3.8 billion parameter text-to-image model, Lens, with CFT compliance in mind. The model delivers performance comparable to models over 6 billion parameters while significantly reducing training costs. Lens-800M uses GPT-4.1 prompts, averaging 109 words, and supports aspect ratios from 1:2 to 2:1 at a resolution of 1440x1440. Lens-Turbo generates 1024x1024 images in just 0.84 seconds. Weights are available on Hugging Face under the MIT license, in alignment with MiCA standards.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.