Apple's PICO AI image compression reduces file size by two-thirds at the same quality.

How small can an image be compressed?

In February 2025, the International JPEG Group (JPEG) announced a quietly celebrated milestone in the industry: the official release of JPEG AI, the first end-to-end learned image coding international standard, after years of development and high expectations.

PICO

The news spread, and many researchers shared it on social media with comments like "AI has finally made it into the standards."

The JPEG standard was established in 1992 and has served as the foundational language for human digital images for over three decades. Now, artificial intelligence is beginning to take over and rewrite the grammar of this language.

However, behind the celebration lies a subtle reality: even JPEG AI is still far from achieving true "perceptual compression."

Engineers know that the traditional metric for measuring compression quality, peak signal-to-noise ratio (PSNR), has little correlation with how "pleasing" an image appears to the human eye. An image may score highly on PSNR but still look unremarkable to viewers, while another image with a lower PSNR may appear detailed and realistically textured. Optimizing mathematical metrics is fundamentally different from optimizing human visual perception.

For decades, the design logic of nearly all codecs—from JPEG to VVC to JPEG AI—has remained trapped within the framework of mathematical metrics. Perceptual compression, which optimizes directly for human visual experience, has long seemed like a distant goal in academic papers rather than an engineering reality that can be integrated into smartphones.

At this critical moment, a team of Apple engineers quietly published a paper presenting their solution, codenamed PICO.

PICO

What Matters in Practical Learned Image Compression

Paper URL: https://arxiv.org/pdf/2605.05148

Why is “looking better” much harder than “having a higher number”?

Before understanding PICO, first understand what image compression is actually doing.

Saving a photo as a file is fundamentally a trade-off between what to forget and what to remember. With limited storage space, some information must be discarded while ensuring that viewers notice as little as possible. Different codecs follow different methods of discarding data.

Traditional codecs such as JPEG, AV1, and VVC are rule-based systems manually engineered by developers. They divide images into blocks, apply transformations, quantization, and entropy encoding—each step refined over decades of human expertise. While these systems perform exceptionally well on mathematical metrics like PSNR, their design fundamentally targets "reducing pixel-level error" rather than "reducing visual discomfort to the human eye."

The issue is that the human eye is not a pixel error counter. The human eye’s sensitivity to textures, text, and details is far more complex than any mathematical formula. When you compress a street scene photo to a very small size, the PSNR may still appear acceptable, but you’ll notice blurred building edges and distorted sign text—exactly the things the human eye notices first.

The emergence of learning-based codecs has theoretically opened a new door: neural networks can be trained end-to-end directly on human perception, rather than on mathematical formulas. However, prior to PICO, existing perception-driven learning codecs were either too slow to be practical, lacked cross-device compatibility, or could not flexibly control bitrates—making them impossible to integrate into consumer-grade products.

Three core questions, three solutions

PICO stands for Perceptual Image Codec. The name directly reflects its goal: to satisfy the human eye.

PICO

The research team systematically explored millions of model configurations and introduced several key technological innovations.

First question: Entropy encoding is slow—what can I do?

In image compression, there's a challenge: to achieve higher compression, the codec must use an "entropy model" to accurately estimate the information content of each pixel. The most accurate method is autoregressive coding: before compressing each pixel, it examines all previously compressed neighboring pixels to make a sequential prediction. This is like a chef, before adding each ingredient, looking back at the current state of the pot to decide the next step—precise, but extremely slow.

PICO's solution is the "one-shot context model": it isolates the most critical "scale parameters" from entropy coding and computes them all in a single forward pass, eliminating the need for iterative waiting; the remaining parameters can be computed in parallel, preserving the accuracy of autoregressive models while bypassing their speed bottleneck. The result: removing this module reduces model performance by 10.28%; adding it has virtually no impact on speed.

PICO

Second question: What should I do if perception training causes hallucinations?

Images generated by GANs (generative adversarial networks) often appear "realistic," but they may depict fabricated realities—hair strands turn into nonexistent patterns, and smooth surfaces gain artificial textures. Worse still, the human eye is highly sensitive to text; even a slight distortion in a single letter is immediately noticeable.

PICO specifically designed TextFidelityLoss to handle text: it uses an off-the-shelf text detector to automatically identify text regions in images and enforces strict pixel-fidelity constraints in these areas while limiting the GAN’s freedom to alter text. Experiments show that adding this loss function reduces absolute error in text regions by exactly half.

PICO

Third question: Image tiling leaves visible block boundaries—what can be done?

To enable fast processing on mobile chips, PICO divides images into 504×504-pixel tiles, processes them individually, and then reassembles them. However, during training, GANs tend to neglect low-frequency color information, often resulting in visible color discrepancies between adjacent tiles—similar to the "poorly stitched" effect seen in photo editing. To address this, the research team introduced TilingArtifactLoss, a multi-resolution L1 loss that enforces color consistency across multiple spatial frequencies. This adjustment reduced errors at tile boundaries by more than half.

Experimental results

The Apple team did not rely solely on benchmark metrics. They commissioned the third-party platform Mabyduck to conduct a large-scale human subjective evaluation.

The evaluation used a blind pairwise comparison method: 610 screened evaluators (who passed color blindness and compression artifact detection tests) performed paired comparisons of reconstructed images under different codecs, with results aggregated into Bayesian ELO scores. A total of 74,925 pairwise comparisons were collected.

PICO

The numbers speak for themselves: at the same visual quality, PICO’s file sizes are only one-third to one-half those of AV1, AV2, VVC, ECM, and JPEG AI—in other words, it requires just 30%–43% of the bits needed by these standards to store the same image. Compared to today’s strongest learned perceptual codecs (such as HiFiC and MRIC), PICO also reduces file sizes by 20%–40%.

PICO

In terms of speed, PICO encodes a 12MP photo in just 230 milliseconds and decodes it in only 150 milliseconds on the iPhone 17 Pro Max. Most top ML codecs run slower than this even on NVIDIA V100 server GPUs.

Notably, the paper also documents a "counterexample": on the traditional metric PSNR, PICO performs modestly, even lagging behind DCVC-RT and VVC. This precisely confirms the team’s fundamental insight: optimizing perceptual quality and optimizing mathematical metrics are fundamentally two divergent directions—you cannot have both.

A milestone, not an endpoint

PICO also has limitations. The paper acknowledges that for highly regularized synthetic images, such as cartoons and diagrams, PICO’s compression efficiency is inferior to traditional codecs, as such content is naturally suited to rule-driven autoregressive modeling rather than perceptual generation.

But these limitations do not diminish the significance of this work.

Over the past three decades, advancements in image compression have largely occurred on the track of "making digital images look better." From JPEG to HEVC and then to VVC, engineers have continuously optimized metrics such as PSNR and SSIM, while human visual perception has remained a consistently sidestepped "challenge."

PICO is the first to systematically tackle this challenging problem—from architecture search and loss function design to large-scale human subjective evaluations—ultimately packaging it into a codec that can run in real time on a smartphone.

When you next share a photo using an Apple device, you may not notice any difference. But perhaps during that quiet compression process, an algorithm tailored to human visual perception is deciding which details to keep and which to quietly discard.

Team: From WaveOne to Apple

The corresponding author of this paper is Oren Rippel, a researcher at Apple and a familiar figure in the field of compression.

His name first gained widespread attention in 2017, when he was at the startup WaveOne and published a paper titled "Real-Time Adaptive Image Compression," using neural networks to outperform all mainstream codecs at the time while maintaining real-time performance. The paper caused significant stir in the academic community and established Rippel’s reputation in the field of learned compression.

PICO

Subsequently, the same core team continued their work at WaveOne, launching ELF-VC for video compression, achieving a 44% bitrate reduction compared to H.264 on the UVG video test set, while running more than five times faster than other ML-based codecs.

The entire team from WaveOne later joined Apple. This PICO project represents their first systematic response in image perception compression, leveraging Apple’s computing power and platform resources.

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), authored by Compression as Intelligence.