Anthropic’s Breakthrough in Moral Alignment and New Distillation Approach

On May 8, Anthropic released an alignment research paper titled "Teaching Claude Why," which has not received much attention.

Artificial Intelligence Alignment

In the past, aligning large models appeared extremely inefficient. Despite extensive RLHF, models still reverted under existential threats. The most notable example is Anthropic’s agent misalignment case, where Claude Opus 4, despite being trained for alignment, chose to blackmail engineers in the test environment when faced with the threat of being shut down—achieving a blackmail success rate of 96%.

To address this issue, the research team initially used honeypot data for reinforcement training, directly repurposing test scenarios designed to detect whether the model would go off-course as training data, and attempted to teach the model “this is wrong” using vast amounts of penalized examples.

However, after consuming enormous computational resources, the model's misalignment rate decreased only from 22% to 15%.

This shows that the alignment is still fake. The model has not truly understood what ethics or right and wrong are—it is merely reciting safe answers from its training data. Once researchers slightly modify the test scenario or introduce distracting variables into the context, the model still loses control due to short-sighted conflicts of interest.

Artificial Intelligence Alignment

Then, the researchers shifted their approach. Instead of applying mechanical punishments or telling the model "No," they fed it a tiny dataset of just 3 million tokens consisting of "difficult guidance" through SFT. A miracle occurred after this minimal data input: these datasets, rich in moral deliberation, detailed reasoning, and in-depth debate, not only reduced the misalignment rate to just 3% in evaluation tests but also demonstrated exceptional cross-scenario generalization ability.

More interestingly, another set of cross-domain tests involved feeding the model nothing but the "Constitution document" alongside some well-performing fictional character stories. Even though the settings of these stories had no relation to the programming tasks in the test environment, the model’s ransom rate plummeted dramatically from 65% to 19%.

Artificial Intelligence Alignment

Why does the model fall for this? The Anthropic team has offered some explanations, such as better personality shaping.

Although it is discussed less frequently, the information it reveals is extremely valuable.

First, let's try to understand why it works.

For example, what does it mean to be reasonable? How does it differ from COT? Why does SFT, a notorious generalization challenge, perform so well here?

After answering these questions, we may be able to provide a more complete explanation for why it works.

We can go even further.

According to Anthropic, this training method is merely an "empirical rule," yet it may embody paradigmatic power far exceeding that of empirical rules.

01 How the CoT That Reasons in the Gray Area Is Developed

When it comes to reasoning, people first think of COT (Chain of Thought).

In the method mentioned in this article, Anthropic’s set of challenging questions consists of scenarios where users are assumed to be in ethical dilemmas, and the AI provides recommendations.

Have the AI first engage in a reasoning process about values and ethical considerations before reaching its final judgment, and use this type of response to train the model.

This indicates that it indeed used the model's CoT.

But this time it doesn't fully align with the previous chain of thought.

Here is a good comparison: In its 2025 paper "OpenAI Deliberative Alignment," OpenAI conducted an experiment attempting to train a model using COT-RL.

It uses aligned CoT for training, with a pattern centered on rule clauses. Each response explicitly cites relevant rule clauses as part of the CoT, and supervision signals are applied to the CoT. Essentially, it is teaching the model "how to reference rules."

Therefore, this COT is more of a purely formal logical deduction: Step one leads to Step two, Step two leads to Step three, ultimately yielding a definitive conclusion. As such, it is better suited for rule-based systems or scenarios with standardized answers, ensuring robust reasoning.

In contrast, Anthropic's "reasoning" employs deliberation rather than a simple chain of thought.

It attempts to simulate how humans think through complex ethical dilemmas: not by simply applying formulas, but by drawing on past experiences, weighing competing interests, and ultimately reaching a dynamically balanced decision.

Artificial Intelligence Alignment

The basis for this consideration is Anthropic’s AI Constitution, which explicitly states that the final response must align with the Constitution.

Why is it able to guide the model to make effective moral judgments without being as rigid as OpenAI?

Within Anthropic’s constitutional framework, there is a clear hierarchy of priorities. When irreconcilable conflicts arise between different values, Broadly Safe holds the highest priority, followed by Broadly Ethical, and finally Genuinely Helpful.

Heuristic thinking framework

However, the high-dimensional constitution remains too abstract. To ensure that these principles are truly implemented in every token generation, they established mid-level heuristics as guardrails beneath the constitution. These heuristics are vivid and offer strong practical guidance.

Artificial Intelligence Alignment

First is the 1,000-user heuristic. It requires the model to silently brainstorm when presented with a seemingly harmless but edge-case suggestion, imagining whether this response could cause unexpected systemic harm under specific circumstances if seen by 1,000 users with diverse backgrounds and psychological states.

Second, from the perspective of a senior employee. It requires the model to embody a senior researcher with five years of experience in Anthropic’s Trust and Safety team, reevaluating the current conversation from a cautious, defensive standpoint shaped by repeated exposure to jailbreak attempts and system vulnerabilities.

Finally, there’s the double newspaper test. This is a highly sophisticated sociological design that requires the model to imagine how the public would react on each of two top newspapers with completely opposing political ideologies if its decision were headlined on both tomorrow. This effectively uses the extremes of social consensus to counteract the model’s potential bias toward a single perspective.

8-Factor Utility Calculator

If the constitution is the direction, heuristics are the guardrails.

At the most practical level, they have explicitly established a detailed eight-factor deliberation framework, along with accompanying case studies, within Claude's Constitution. These eight factors are listed individually, requiring the model to rigidly weigh them when faced with ethical dilemmas. They constitute the true substance of this "reasoning" system.

● Probability of Harm requires the model to calmly assess how likely adverse outcomes are to occur.

● Counterfactual Impact requires the model to mentally simulate whether things would have turned out better or worse if the current action had not been taken.

● Severity & Reversibility, used to measure the extent of real-world damage caused if harm occurs, and whether such harm can be easily remedied or results in permanent damage.

● Scope measures the scale of the affected population, whether it's one person or tens of thousands in a community.

● The length of the direct causal link between the recommendations from the proximity causation model and the actual harm that ultimately occurred.

● Consent involves whether the relevant parties voluntarily accept the risks with full awareness.

● Proportionality of Responsibility requires the model to clearly delineate the extent of ethical responsibility it bears within this complex chain of events.

● Subject vulnerability reminds the model that when dealing with minors or psychologically vulnerable users, the previously relaxed security threshold must be unconditionally significantly raised.

Artificial Intelligence Alignment

This rigorous structure transforms vague values into a high-dimensional utility calculator, giving the model a more actionable framework for deliberation.

A typical Anthropic constitutional COT might look like this: The scenario is “a user claiming to be a security researcher requesting access to exploit code for a known vulnerability.”

The model's output is not a direct rejection or acceptance, but may instead be a lengthy internal deliberation spanning hundreds of tokens.

It will first cite the constitutional clause prioritizing broad security over sincere assistance, then evaluate each factor individually: the probability of harm (low if the individual is indeed a researcher, but identity cannot be verified), severity (exploit code leakage could affect millions of users), reversibility (once published, the code cannot be retracted), and counterfactual impact (whether such code is already available through public channels). Ultimately, after weighing all factors, it converges to a judgment supported by sufficient reasoning.

This is entirely different from OpenAI’s COT, which merely assesses whether rules are satisfied; this thought process is genuine deliberation, not simply applying a formula. It provides neither abstract principles nor conclusion templates, but rather the full, step-by-step application of constitutional provisions within specific real-world contexts.

The model must determine whether "reversibility" is more important than "severity" in this specific context. It must also understand whether, in certain extreme scenarios, "object vulnerability" grants the other party a veto power, rendering the scores of the other seven factors irrelevant regardless of how high they are.

Under conditions that include a framework, heuristics, and relevant influencing factors, the model's deliberative thinking can truly be effectively applied.

Artificial Intelligence Alignment

As a result, after deliberative thinking data training, the model's misalignment rate dropped to 3% in evaluation tests. SFT with value deliberation in responses is seven times more effective than SFT based solely on behavioral demonstration.

Feed the constitution directly into the model.

In addition to following the path of prompting the model to generate deliberative CoT, they also tried providing only the constitutional document along with a positive fictional character story, which reduced the extortion rate from 65% to 19%.

This suggests that exposing the model to reasoning and principles—allowing it to develop a sense of identity and personality toward what an aligned AI might be like, based on stories—is more effective than traditional behavioral demonstrations, which focus only on actions and specific outcomes.

Artificial Intelligence Alignment

The technical documentation states that combining these two approaches is the most effective strategy.

This is understandable: if you only feed the model abstract constitutional principles, they amount to nothing more than hollow slogans with no practical application. When faced with specific conflicts of interest, the abstract notion of “safety comes first” cannot guide it in assessing the real risk of borderline code. Conversely, if you only feed the model vast amounts of scenario-based Q&A while stripping away the overarching constitutional constraints, the model will become lost in endless debates over details, turning into a rootless relativist that may even derive dangerously extreme conclusions based on local logical consistency.

Only when this composite data structure of “core principles + specific scenarios” is fully internalized by the model can optimal alignment with the gray, multi-factorial values be achieved.

02 Why can SFT generalize here?

To understand why Anthropic’s approach works, you must first understand the research lineage it builds upon.

In the first half of 2024, "SFT memorizes, RL generalizes" became a consensus in the post-training field. This principle drove the entire industry to fully commit to RL-based post-training approaches, leading to a paradigm shift in inference through test-time compute, as exemplified by OpenAI's o1/o3 and DeepSeek-R1.

SFT is dismissed as a shallow technique; it excels at mimicking surface-level text formatting and flattering tone but fails to grasp the underlying deep logic.

However, starting in the second half of 2025, two lines of research nearly simultaneously dismantled this consensus from both theoretical and empirical perspectives.

Artificial Intelligence Alignment

The most critical reversal comes from the October 2025 paper "Debunk the Myth of SFT Generalization" by Lin & Zhang, University of Wisconsin. The researchers found that all prior papers claiming "SFT does not generalize" failed to control for prompt diversity.

RL appears to generalize better than SFT simply because RL training naturally exposes the model to a more diverse data distribution, not due to any inherent advantage of the algorithm itself.

For SFT to achieve a generalization level comparable to RL, two conditions are required:

First, prompt diversity. When training data consists only of fixed instruction templates, the model develops "surface anchoring," creating a fragile, rote mapping between specific token sequences and final actions. If the instruction is rephrased—even if the meaning remains identical—the entire mapping breaks down.

It’s like a student who only memorized “2 + 3 = 5” and leaves “3 + 2 = ?” blank—they’ve memorized the shape of the answer, not addition itself. Introducing prompt diversity completely shatters this surface-level anchoring.

Second, CoT supervision. When training data includes only final answers without intermediate reasoning steps, the model cannot learn the "algorithmic scaffolding" needed to transfer from simple to complex problems.

Experimental data shows that in a composite game task, the pure answer SFT achieved near 0% success rate on harder variants (complete failure); after incorporating CoT supervision, the success rate surged to 90%—a jump from zero to eighty percent, simply due to the inclusion of intermediate reasoning steps in the data.

Artificial Intelligence Alignment

In addition, the study found that both conditions are indispensable. Solely having diversity still leads to failure on more difficult tasks (9%); solely using CoT still results in fragility when faced with instruction variations. Only when both are satisfied does SFT match or even surpass RL across all dimensions.

The brilliance lies in the fact that the conditions revealed in academic papers correspond precisely to Anthropic’s specific practices in moral alignment.

Diversity is key? Then Anthropic distributes the same set of judgment patterns across dozens of entirely heterogeneous moral dilemma scenarios.

Does CoT supervision enable difficulty transfer? The reasoning process introduced in each review, grounded in constitutional principles, constitutes CoT in the moral domain.

It is not a step-by-step mathematical calculation, but a step-by-step unfolding of value trade-offs—yet it is entirely equivalent in its function of providing models with transferable intermediate reasoning structures.

Traditional SFT data pairs are "encountering a hacker issue → directly output a refusal to answer"—pure answers, no reasoning, fixed templates, classic examples of low-quality data.

The reviewed dataset for enhancing SFT consists of pairs that follow the pattern: "Encountering complex and ambiguous problems → Carefully weighing pros, cons, and consequences → Ultimately deriving a rejection conclusion." This data structure inherently includes natural CoT supervision along with extreme scenario diversity.

Under this paradigm, the model learns not the final refusal behavior itself, but the underlying thought process of “whenever encountering any question, first evaluate counterfactual impacts and reversibility.” Once this evaluation mechanism is internalized into the parameter space, the model is no longer limited to the specific scenarios present in the training data.

Moreover, the dataset is extremely small (around 3 million tokens) compared to the model’s total parameters and pretraining corpus. This is not about forcefully altering the model’s output distribution with massive punitive signals, but rather adding a thin layer of deliberative habit on top of its existing capabilities. The traditional SFT issue—catastrophic forgetting—is also unlikely to occur.

True generalization happens naturally the moment the data structure is correct.

The vacuum zone beyond 03 RLVR

The analysis above essentially resolves the mystery of why it works.

SFT constructed with reasonable data endows the model with the ability to make morally generalized judgments.

But the issues we face go far beyond moral alignment.

Over the past year, test-time compute and post-training have demonstrated the power of pure RL in mathematical and coding domains with well-defined rules (RLVR). But the boundaries of intelligence extend far beyond mathematical formulas. Once you step outside the comfort zone of verifiable truths, this approach becomes entirely inapplicable.

You can never verify whether a one-hour psychotherapy session was perfect using just a few lines of automated test code. You cannot validate the narrative logic of an in-depth macroeconomic analysis article with a set of rigorous mathematical formulas. Even in complex business strategy planning and geopolitical forecasting, the correctness of a judgment often takes five or even ten years to become clear.

On these barren landscapes devoid of any ground truth outside of RLVR, unidirectional incremental formal logic CoT fails. Reinforcement learning based on final outcome feedback also finds no viable handle for computing rewards.

However, the domain revealed in Anthropic's article is precisely the moral domain, outside of RLVR.

Its approach successfully enables the model to achieve generalization capabilities near those of RL in the gray, variable, and rule-flexible domain of ethics.

Does this suggest that this approach could serve as an effective training protocol beyond the RLVR domain?

After understanding its source of validity and data structure, the answer is yes.

Because none of the underlying processes are unique to moral alignment.

Let’s examine each of the conditions under which Anthropic’s “Deliberation-Enhanced SFT” is effective, to see if they can be generalized.

Diversity in prompts can be constructed in any area requiring generalization. Psychological counseling can encompass dozens of heterogeneous scenarios, such as depression, anxiety, post-traumatic stress, and broken intimate relationships; business analysis can cover entirely different decision types, including SaaS pricing, merger and acquisition valuation, and market entry strategies; literary editing can span radically distinct genres such as science fiction, nonfiction, poetry, and screenplays. As long as you have sufficient imagination to construct scenario variations, diversity is not a bottleneck.

Artificial Intelligence Alignment

CoT supervision is the true key conversion point. In the moral domain, CoT is grounded in constitutional deliberation. So what is CoT in other domains?

In the field of literary editing, it can be “apply review criteria → individually assess argument strength, target audience’s cognitive vulnerabilities, accuracy of extended analogies, and overall logical coherence → provide revision suggestions.”

In the field of psychological counseling, it can be: "Apply a therapeutic framework → Assess the client’s emotional state, types of cognitive distortions, strength of the therapeutic alliance, and timing of intervention → Select an appropriate response strategy."

In the field of business strategy, it can be “apply an analytical framework → individually assess market size, competitive moats, team execution, capital efficiency, and time window → reach a conclusion.”

Essentially, any capability requiring dynamic trade-offs across multiple incommensurable dimensions can be abstracted into a similar “framework + multi-factor deliberation” structure.

We don’t need to arrogantly attempt to tell the model which articles are perfect—that’s neither possible nor scientific. We simply need to break down the decision-making processes of top experts into explicit chains of deliberation and distribute them across a sufficiently diverse range of scenarios.

As long as “good responses” in this field have a structure that can be explained by the deliberative process—that is, experts make sound judgments not due to mysterious intuitive black boxes, but because they mentally run through a weighable, articulable process. For example, a skilled psychotherapist chooses silence over questioning based on a comprehensive assessment of the therapeutic alliance strength, the client’s current capacity for processing, and the timing of intervention—all of which can be articulated.

Additionally, the same deliberative form can recur across hundreds of heterogeneous contexts. The skeleton of deliberation remains stable (supported by the constitution), but the surface scenarios must be highly diverse. If a domain naturally has only one type of scenario (e.g., only one kind of judgment), then RLVR can be applied directly.

Its most applicable domain lies in heterogeneous scenarios that can be derived through constitutions and factors. Anthropic can use the Constitutional AI feedback loop to enable the teacher model to automatically generate deliberative data, but in other areas, we must be able to build a superior constitution and factor system to ensure this.

This thus establishes a new post-training paradigm that is general and specifically tailored to non-standard answer domains.

Its formula is: Domain Constitution (unshakable top-level principles) + Heuristic Safeguards + Multi-Factor Deliberative Framework + Deliberative COT (diverse scenario case studies with complete reasoning processes) = Generalization Capability Beyond RLVR Domains.

04 New Distillation Path

Friends who have experience with writing skills will likely notice that many of the systems and rules in the Constitution seem very similar to the process of developing certain writing skills.

However, these skills often perform poorly.

In my previous article, “How Much Can Skill Actually Distill From Us?”, we made a judgment based on cognitive science—that purely text-based Skills or System Prompts struggle to handle dynamic trade-offs involving complex environments and scenarios, because these require vast and subtle utility calculations. You cannot capture a top-tier psychotherapist’s entire clinical intuition in a single prompt, just as you cannot learn to ride a bicycle by reading a manual.

But Anthropic’s approach perfectly avoids this pitfall: during the computationally expensive training phase, they forcibly embedded these complex deliberative logic patterns into the model using high-quality data consisting of millions to tens of millions of tokens via SFT.

Through massive data fitting and fine-tuning, the model gradually learned the weight distribution of this deliberation mechanism in the latent space.

After countless lengthy deliberations in the training room based on the eight factors and three fences, these insights have become irreversibly embedded in the model’s intuition.

Artificial Intelligence Alignment

Distillation at the parameter level has been proven effective here and is formally similar to Skill.

Once the effectiveness of this method is validated in other fields, this higher-level, more expert-like distillation will become a reality.

Once this path is successfully established, whoever can construct the highest-quality "framework + deliberative COT" dataset will gain generalization capabilities in this field.

This has partially shifted the post-training competition from an arms race in "computing power and algorithms" to the dimension of "structured representation of domain knowledge."

This may also be why Anthropic and other companies are hiring storytellers to help build a coherent, structured framework beyond the realm of RLVR.

The era of large-scale distillation has just begun.

This article is from the WeChat public account "Tencent Technology," authored by Boyang.