PPO Algorithm Rejected by NIPS 2017, Later Became Key in LLM Training

Rejection does not equal failure.

Article author and source: MachineHeart

That's truly surprising.

The classic algorithm PPO (Proximal Policy Optimization), which later became widely used in RLHF and large model training, was once rejected by NeurIPS 2017.

This matter was recently brought up by John Schulman, the author of PPO himself. He summed up this episode in just one sentence: PPO was rejected by NeurIPS 2017.

This paper, originally published in July 2017, initially appeared as a simpler and more engineering-friendly algorithm for policy optimization. Its goal was to reduce implementation complexity while preserving the stability of TRPO, making reinforcement learning training easier to tune and more practical.

But years later, it was not traditional reinforcement learning tasks like Atari or robotics control that propelled PPO to a larger stage, but rather large language models.

From RLHF to today’s RLVR, PPO has become one of the essential algorithms in post-training large models. According to Schulman, PPO has experienced a second wave of popularity in the LLM era, with its impact even surpassing what the original paper anticipated.

This doesn't seem like Schulman complaining about his paper being rejected back then, but rather a reflective observation afterward: the true impact of a technology often emerges in ways the inventor never initially anticipated.

At this point, many people naturally wonder: Why was PPO rejected back then?

Schulman later explained that the paper was considered to have limited innovation and did not show a significant improvement over existing baseline methods at the time.

A netizen commented, "This reflects a misalignment between academic evaluation and real-world industry needs. Academia often prioritizes novelty and relative improvements over baselines in small-scale, controlled experiments, while the real world cares more about whether a method can scale to larger sizes, remain stable in complex systems, and actually function in practice."

Schulman also remained calm about it, stating that it was a long time ago and expressing hope that, over the years, the academic community has gradually come to understand and embrace this "simple yet scalable" aesthetic.

What truly surprised him was that the PPO paper and its objective function have continued to have such a lasting impact. It’s often difficult to tell at the outset whether a change to an algorithm will quickly be forgotten and replaced by minor tweaks, or whether it will remain embedded in the system as a foundational component that’s hard to surpass.

And PPO's story precisely illustrates this point.

In fact, it’s not just PPO—many influential works in AI history were initially rejected by top conferences.

LSTM: Rejected by NIPS in 1996 for being too complex and lacking biological plausibility, it later became a core technology for sequence modeling tasks such as speech recognition and machine translation.

SIFT: Originally rejected by ICCV 1997 and CVPR 1998 due to its cumbersome engineering steps and lack of elegance, it went on to dominate computer vision for over a decade before the deep learning era.

Dropout: Rejected by NIPS in 2012 for being seen as an engineering hack with insufficient theoretical rigor. It later became one of the most important regularization methods for deep neural networks and received the NeurIPS Test of Time Award.

Sometimes, time is the strictest and fairest judge.