Nucleus-Image Open-Sourced with 17B Parameters, 2B Activated per Inference

ME News reports that on April 16 (UTC+8), according to monitoring by Beating, the Nucleus AI team released the text-to-image model Nucleus-Image, simultaneously open-sourcing the model weights, training code, and training dataset under the Apache 2.0 license, which permits commercial use. The model employs a sparse Mixture-of-Experts (MoE) diffusion transformer architecture, with a total of 17B parameters distributed across 64 routing experts per layer, activating only approximately 2B parameters during inference—significantly reducing inference costs compared to dense models of similar scale. On three standard benchmarks, Nucleus-Image matches or exceeds leading proprietary models: it achieves a GenEval score of 0.87, tying with Qwen Image, and ranks first among all compared models in the spatial positioning subtask (0.85); it scores 88.79 on DPG-Bench, placing first overall; and it achieves a score of 0.522 on OneIG-Bench, surpassing Google’s Imagen4 (0.515) and Recraft V3 (0.502). All these results were achieved using pure pre-training without DPO, reinforcement learning, or human preference tuning. Nucleus AI officially describes this as “the first fully open-source MoE diffusion model at this quality level.” The training data was大规模 scraped from the web, filtered, deduplicated, and scored for aesthetics, retaining 700 million images and generating 1.5 billion text-image pairs. Training proceeded in three stages, progressively increasing resolution from 256 to 1024, totaling 1.7 million steps. The text encoder uses Qwen3-VL-8B-Instruct, invoked via the diffusers library, and incorporates cross-denoising-step text KV caching to further reduce inference overhead. For developers seeking to deploy image generation locally, the design of 17B total parameters with only 2B activated makes it feasible to run on consumer-grade GPUs. Full open-sourcing—including weights, training code, and dataset—is rare; most open-source image models release only weights, keeping datasets and training details proprietary—a major bottleneck for reproducible research in text-to-image generation. (Source: BlockBeats)