Perplexity AI open-sources pplx-garden to enable high-speed multi-GPU inference

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
Perplexity AI has open-sourced its high-performance inference toolkit, pplx-garden, to enhance multi-GPU processing. The toolkit includes a Rust-based communication library, fabric-lib, that bypasses NVIDIA’s protocols and supports 400 Gbps bandwidth via NVIDIA ConnectX-7 and AWS EFA. It features zero-copy data transfer and MoE-optimized algorithms, reducing CPU usage during tokenization. This AI and crypto news update highlights a new tool for developers. Inflation data trends may influence future investments in AI infrastructure.
ME AI News: According to monitoring by Beating, search engine giant Perplexity AI has officially open-sourced pplx-garden, a high-performance inference infrastructure toolkit used in production. At the core of the project is fabric-lib, a proprietary high-performance peer-to-peer communication library written in Rust (also known as TransferEngine), designed to break free from NVIDIA’s proprietary communication protocol and hardware lock-in, enabling developers to run trillion-parameter large models at high speed across heterogeneous multi-GPU clusters without needing to purchase expensive proprietary network switches. Traditional distributed large model inference heavily relies on NVIDIA’s proprietary high-speed networking, resulting in extremely high hardware deployment costs and supply chain lock-in. fabric-lib achieves hardware decoupling, fully supporting NVIDIA ConnectX-7 network cards while natively supporting Amazon’s cost-effective AWS EFA Ethernet NICs, directly maximizing inter-GPU network bandwidth to 400 Gbps. To address the physical limitation of out-of-order transmission in AWS EFA, Perplexity pioneered the ImmCounter synchronization mechanism, enabling efficient “zero-copy” data transfer without requiring strict assumptions about packet ordering. The communication library includes a data distribution algorithm specifically designed for Mixture-of-Experts (MoE) models, deeply overlapping GPU data reception with matrix computation to significantly optimize compute capacity during the decoding phase. In real-world production, pplx-garden delivers remarkable engineering benefits: in a decoupled inference architecture, the network library enables ultra-fast scheduling of key-value caches between Prefill and Decoder nodes; in asynchronous reinforcement learning training, weight synchronization and distribution for trillion-parameter models are completed in just 1.3 seconds. To address computational latency during tokenization, pplx-garden also open-sources pplx-unigram, a Rust-reimplemented tokenizer that reduces CPU consumption by 5 to 6 times, eliminating performance bottlenecks in the tokenization stage for reordering and vector models. (Source: BlockBeats)
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.