Perplexity AI open-sources pplx-garden to enable high-speed multi-GPU inference

ME AI News: According to monitoring by Beating, search engine giant Perplexity AI has officially open-sourced pplx-garden, a high-performance inference infrastructure toolkit used in production. At the core of the project is fabric-lib, a proprietary high-performance peer-to-peer communication library written in Rust (also known as TransferEngine), designed to break free from NVIDIA’s proprietary communication protocol and hardware lock-in, enabling developers to run trillion-parameter large models at high speed across heterogeneous multi-GPU clusters without needing to purchase expensive proprietary network switches. Traditional distributed large model inference heavily relies on NVIDIA’s proprietary high-speed networking, resulting in extremely high hardware deployment costs and supply chain lock-in. fabric-lib achieves hardware decoupling, fully supporting NVIDIA ConnectX-7 network cards while natively supporting Amazon’s cost-effective AWS EFA Ethernet NICs, directly maximizing inter-GPU network bandwidth to 400 Gbps. To address the physical limitation of out-of-order transmission in AWS EFA, Perplexity pioneered the ImmCounter synchronization mechanism, enabling efficient “zero-copy” data transfer without requiring strict assumptions about packet ordering. The communication library includes a data distribution algorithm specifically designed for Mixture-of-Experts (MoE) models, deeply overlapping GPU data reception with matrix computation to significantly optimize compute capacity during the decoding phase. In real-world production, pplx-garden delivers remarkable engineering benefits: in a decoupled inference architecture, the network library enables ultra-fast scheduling of key-value caches between Prefill and Decoder nodes; in asynchronous reinforcement learning training, weight synchronization and distribution for trillion-parameter models are completed in just 1.3 seconds. To address computational latency during tokenization, pplx-garden also open-sources pplx-unigram, a Rust-reimplemented tokenizer that reduces CPU consumption by 5 to 6 times, eliminating performance bottlenecks in the tokenization stage for reordering and vector models. (Source: BlockBeats)