Xiaomi open-sources the video sound generation framework ControlFoley

ME News report, May 29 (UTC+8): According to monitoring by Beating, Xiaomi’s Large Model Applications team has released and open-sourced the video sound effects generation framework ControlFoley. Previously, AI video dubbing primarily relied on models inferring sounds from visuals, making it difficult for creators to precisely control audio style. ControlFoley emphasizes “controllability”: it can generate audio based on video content while also accepting text descriptions or reference audio to produce sounds aligned with the creator’s intent. For example, it can transform a knock sound into a “metal strike” or match a drumming timbre to a tennis ball impact—all while maintaining audio-visual synchronization and adhering to the specified style. At its core, ControlFoley employs a spatiotemporal audiovisual encoder based on CAV-MAE, incorporating a “time-timbre decoupling” strategy that assigns timing to the video and timbre style to the reference audio. In multi-task evaluations defined in the paper, ControlFoley achieves state-of-the-art (SOTA) performance among open-source models on standard video dubbing benchmarks. Even when textual instructions strongly conflict with visual content, the model still balances text adherence and temporal synchronization. Compared to the commercial closed-source system Kling-Foley, ControlFoley is competitive across multiple metrics including semantic alignment, synchronization, and perceptual quality; however, it still lags slightly in certain KL divergence matching metrics on Kling-Audio-Eval and MovieGen-Audio-Bench. Currently, the project’s technical report, code, model weights, and demo are all publicly available. (Source: BlockBeats)