ME News report, May 29 (UTC+8): According to monitoring by Beating, Xiaomi’s Large Model Applications team has released and open-sourced the video sound effects generation framework ControlFoley. Previously, AI video dubbing primarily relied on models inferring sounds from visuals, making it difficult for creators to precisely control audio style. ControlFoley emphasizes “controllability”: it can generate audio based on video content while also accepting text descriptions or reference audio to produce sounds aligned with the creator’s intent. For example, it can transform a knock sound into a “metal strike” or match a drumming timbre to a tennis ball impact—all while maintaining audio-visual synchronization and adhering to the specified style. At its core, ControlFoley employs a spatiotemporal audiovisual encoder based on CAV-MAE, incorporating a “time-timbre decoupling” strategy that assigns timing to the video and timbre style to the reference audio. In multi-task evaluations defined in the paper, ControlFoley achieves state-of-the-art (SOTA) performance among open-source models on standard video dubbing benchmarks. Even when textual instructions strongly conflict with visual content, the model still balances text adherence and temporal synchronization. Compared to the commercial closed-source system Kling-Foley, ControlFoley is competitive across multiple metrics including semantic alignment, synchronization, and perceptual quality; however, it still lags slightly in certain KL divergence matching metrics on Kling-Audio-Eval and MovieGen-Audio-Bench. Currently, the project’s technical report, code, model weights, and demo are all publicly available. (Source: BlockBeats)
Xiaomi open-sources the video sound generation framework ControlFoley
KuCoinFlashShare






On May 29, Xiaomi's large model team open-sourced the video sound generation framework ControlFoley. The model supports visual input, text prompts, or reference audio for controlling sound style, employing a time-pitch decoupling strategy and a modified CAV-MAE encoder. It achieves top-tier performance in video sound generation tests, though it slightly lags in KL divergence. Traders using open interest analysis may find this development relevant for value investing in crypto.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.