The article introduces a new study titled "CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs," whose primary goal is to optimize the efficiency of Transformer model training by addressing seemingly scattered but cumulatively time-consuming "memory-intensive" operations.

Article author and source: MachineHeart

On May 22, Tri Dao shared a tweet from Han Guo on social media, adding: “After some mathematical rewriting, it turns out that everything in the Transformer is a series of GEMM + epilogue (matrix multiplication plus epilogue). With some optimized primitives, LLMs (and newcomers) can write lightning-fast kernels for all Transformer operations!”

Tri Dao is one of the core authors of the FlashAttention series, and this tweet points to their paper, CODA, released that day.

Paper Title: CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Paper URL: https://arxiv.org/abs/2605.19269
Code address: https://github.com/HanGuo97/coda-kernels

This name sounds like “coda” and is pronounced like “CUDA.” Researchers from MIT, Princeton, Together AI, and Meta aim to systematically eliminate the rarely noticed but time-consuming fragmentary computations in Transformer training through a new programming abstraction.

The "laziness tax" for training large models

To understand what CODA is solving, first understand where the time goes in large model training.

Training a 1B-parameter LLaMA-3-style model on a single NVIDIA H100, most people would intuitively assume that the time is spent primarily on matrix multiplications and attention computations, since those are the “real” calculations. This intuition is largely correct: matrix multiplication (GEMM) and attention do account for the majority of computational workload.

But if you open the profiler and examine closely, you’ll notice a group of smaller operators quietly consuming time: normalization (RMSNorm), activation functions (SwiGLU, RoPE), residual addition, and cross-layer reductions. Individually, their computational load is modest, but they frequently move large intermediate tensors in and out of GPU memory.

This is what’s known as a "memory bandwidth bottleneck": imagine a master chef who must fetch ingredients from a distant warehouse and return them after each dish, rather than having them readily available on the countertop. No matter how fast the chef’s hands move, the time spent waiting for transportation is real waste.

Worse still, as NVIDIA’s low-precision formats like FP8 and FP4 accelerate matrix computations, the relative cost of these “data movement” operations is increasing: matrix multiplication is speeding up, but the costs of transferring tensors in and out are not decreasing proportionally.

A set of data in the paper is particularly clear: when training a 1B parameter model with TorchTitan on an H100, non-matrix multiplication operations account for a significant portion of the end-to-end runtime, and this proportion becomes even more pronounced with the introduction of FP8 precision.

Existing programming frameworks are nearly powerless to address this. PyTorch expresses Transformer computations as a sequence of operators with clear boundaries between them. While these boundaries are ideal for automatic differentiation (autograd), they恰恰 prevent cross-operator fusion optimizations: each operator boundary often corresponds to an unnecessary memory write-back.

CODA: A hidden treasure in the finale

Coda's starting point is a simple observation.

On a GPU, a high-performance matrix multiplication (GEMM) kernel is structurally divided into two parts: the mainloop, which performs the core blocked matrix multiply-accumulate operations, and the epilogue, which handles final tasks such as bias addition, type conversion, and simple scaling before writing the results back to memory.

The significance of the epilogue lies in the fact that the output of the matrix multiplication is still "alive" in on-chip registers and has not yet been written to global video memory. This is a brief golden window: if additional computations can be performed at this moment, an entire round-trip of writing to and reading from video memory can be completely avoided.

Coda's core insight is that many memory-intensive operations in Transformers can be algebraically reparameterized and executed within this "tail" window.

This requires some mathematical finesse. Taking the most common GEMM-RMSNorm-GEMM pattern as an example: the result of one matrix multiplication undergoes residual addition, RMS normalization, and then another matrix multiplication. Traditionally, these three separate operators execute sequentially, with intermediate results written to GPU memory twice.

The CODA team discovered that in RMS normalization, the row scaling factor r, being a scalar shared across each row, commutes with the subsequent matrix multiplication: the application of r can be deferred from "before the second GEMM" to "at the end of the second GEMM." After this deferral, the end of the first GEMM only needs to compute a local "partial RMS," which is consolidated by an extremely lightweight auxiliary reduction kernel, eliminating the full RMSNorm computation entirely.

Similar reparameterizations apply to operations such as SwiGLU, RoPE (Rotary Position Embedding), and cross-entropy loss, and even hold for backpropagation. The paper includes a theorem proving that as long as the forward pass is "block-local," backpropagation automatically inherits the same structure. Please refer to the original paper for details.

Five "building blocks" and a set of "Lego language"

CODA is not a specific fusion kernel, but a set of programming abstractions.

It freezes the expert-optimized GEMM main loop and exposes five composable primitive operations at the epilogue:

Element-wise operations (residual addition, activation function, RoPE)
Vector loading and storage (broadcasting RMSNorm weights)
Block loading and storage of matrices (saving intermediate activations for backpropagation)
Block reduction (local RMS, block log-sum-exp)
State transitions (max and sum-exp statistics required for online normalization)

Using these five types of building blocks, nearly all operations in a standard Transformer’s forward and backward passes—aside from attention—can be covered.

More interestingly, this abstraction is tolerant of who writes the code. In their experiments, the paper evaluated two implementation approaches: one written by human programmers and another generated by Claude Code—given CODA’s primitive specifications, several examples, and implementation logs, the AI generated most of the core code with minimal human oversight.

Both modes achieved high performance levels. Tri Dao stated in a tweet, “LLMs and beginners can write lightning-fast kernels,” which directly reflects the real-world implications of the paper’s experimental results.

Experimental results

Coda's benchmark selects demanding competitors: cuBLAS with torch.compile, as well as Liger Kernel and FlashInfer, both optimized for LLMs.

For each kernel, the paper evaluates two implementations: CODA (LLM), generated by Claude Code with primitive specifications, several examples, and a continuously updated log of implementation tips provided by researchers, where the AI writes the core code and humans provide light supervision; and CODA (Human), independently written by human programmers using the same high-level reparameterization approach but without relying on the CODA primitive set. Both sets of results are compared against optimized libraries such as cuBLAS + torch.compile, Liger Kernel, and FlashInfer.

At the single-operator level, using the typical GEMM-RMSNorm-GEMM pattern as an example, CODA outperforms the cuBLAS + PyTorch baseline across hidden dimensions corresponding to the 1B, 7B, and 70B model sizes. Similar improvements are observed in tail combinations such as SwiGLU, RoPE, and cross-entropy.

LLM-generated kernels perform comparably to human-written versions on most benchmarks, and in some configurations, even slightly outperform them. This is a rare finding in the field of GPU kernel optimization, which has historically had a very high barrier to entry.

The benefits of backpropagation are particularly pronounced: the backward kernel for GEMM-Residual-PartialRMS-GEMM achieves a speedup of 1.6 to 1.8 times over the baseline, while the SwiGLU backward pass shows an improvement of approximately 1.4 to 1.6 times. In this area, the gap between LLMs and hand-crafted implementations is similarly small. This is not surprising: backpropagation inherently involves more access to intermediate tensors, making the benefits of tail fusion even greater; and CODA’s primitive design is sufficiently clear, enabling AI models to correctly compose them.

In end-to-end benchmarks of full Transformer layers, CODA achieves forward acceleration of approximately 5% to 20% across different scales, with more significant improvements observed at larger model sizes (corresponding to hidden dimensions of 70B).

In terms of numerical precision, CODA’s reparameterization adjusts the timing of RMSNorm scaling factor application, but experiments show its numerical error is comparable to the PyTorch reference implementation—and in some configurations, even smaller, thanks to the GEMM main loop’s higher-precision accumulator.

What CODA Can Do: A Quick Reference Guide—Before diving into the bigger picture, let’s clearly define the scope of CODA’s capabilities.

Coverage: Nearly all computations in the forward and backward passes of standard Transformers (e.g., LLaMA architecture), excluding attention and word embeddings, including RMSNorm, residual addition, SwiGLU activation, RoPE rotary positional encoding, cross-entropy loss, and the backward gradient computations for these operations.
Acceleration effect: Across hidden dimensions ranging from 1B to 70B, individual operators show varying degrees of improvement over the cuBLAS + torch.compile baseline, with the most significant gains observed in backpropagation (some kernels achieving over 1.6x speedup); end-to-end forward acceleration for complete Transformer layers ranges from 5% to 20%, with more pronounced benefits in larger model sizes.
CODA is implemented based on CuTeDSL (a Python DSL for NVIDIA CUTLASS), supporting both human programmers and AI models for kernel development, with both approaches achieving high performance.
Current limitations: Only single-GPU scenarios are supported, with no distributed training; parameterization is primarily designed for standard Transformer architectures, and compatibility with other architectures has not been verified.

Conclusion

CODA is not an isolated effort. It is a concrete implementation of a broader idea: on GPUs, the real optimization opportunities often lie not in "what to compute," but in "how to move data."

FlashAttention brings attention computations into on-chip memory, and CODA seeks to do the same for normalization and activation functions. Triton has lowered the barrier to writing custom kernels, while ThunderKittens, TileLang, and others are further exploring this space at various levels. Together, these efforts point toward a single goal: unifying the expressive convenience of PyTorch operator graphs with the near-handwritten-CUDA execution efficiency within a single programmable framework.

The last line of Tri Dao’s tweet is worth reflecting on again: “LLMs and beginners can write light-speed kernels for all Transformer operations.” Underlying this is a deeper logic: when programming abstractions are designed well enough, AI models themselves can participate in optimizing their own training infrastructure. This loop is what makes CODA truly fascinating.

From this perspective, the name "CODA" may carry deeper meaning. In classical music, a coda is the concluding passage that brings a piece to a close. Here, it serves as the "coda" of the GEMM kernel—and crafting this final section may well be the next crucial chapter in improving the efficiency of Transformer training systems.