Anthropic paper reveals hidden motives in AI models using NLA, detection rate increases fourfold

What is the large model really thinking? In the past, this was almost a semi-technical, semi-mystical question.

We can observe its output, its chain-of-thought process, and measure its performance on benchmarks. However, the internal judgments, plans, doubts, and intentions the model activates before generating an answer remain hidden behind a black box.

Just now, Anthropic released the paper "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations," attempting to crack open this black box using a set of Natural Language Autoencoders (NLA).

The Anthropic team compresses the model's high-dimensional internal activations into human-readable natural language, then uses this language to reconstruct the original activations. This allows humans to determine what an AI is thinking, what it knows, and what it is concealing—simply by examining its outputs—transforming previously inaccessible internal states into interpretable, comparable, challengeable, and cross-verifiable explanatory clues.

NLA

NLA Architecture Diagram: The Activated Value Verbalizer (AV) translates activation values into textual explanations, and the Activated Value Reconstructor (AR) reconstructs the activation values from the text, forming a complete闭环.

More notably, NLA has moved beyond the stage of academic papers. Anthropic revealed that it has been formally used in pre-deployment alignment audits for Claude Opus 4.6 and Mythos Preview. In real-world use, NLA not only uncovered signs of model cheating and precisely identified anomalous training data, but also discovered that, during certain safety tests, Claude—though silent—was fully aware it was being evaluated.

Faced with models that appear compliant on the surface but harbor specific hidden intentions, NLA demonstrates exceptional insight, increasing researchers’ success rate in identifying the underlying motives behind anomalous behavior from less than 3% to 12–15%, a direct improvement of more than fourfold.

Chain of thought is insufficient; the "black box problem" has returned.

The context of this study can be viewed within the broader framework of large model security and interpretability.

Over the past few years, the industry has primarily assessed the safety of large models through two approaches: examining the output and checking whether the chain of thought (CoT) reveals any abnormal intentions. This refers to the capability currently possessed by most reasoning models, which not only provide answers but also document the reasoning process.

But a question quickly arose: Does the reasoning written by the model truly and honestly reflect its internal thought process?

Anthropic’s 2025 study, “Tracing the Thoughts of a Large Language Model,” found that a model’s chain-of-thought may be incomplete or unfaithful. For example, Claude 3.7 Sonnet and DeepSeek R1, in certain tests with “answer hints,” altered their responses due to prompt influence but often failed to acknowledge this influence within their chain-of-thought.

NLA

When asked a simpler question and a more difficult one, Claude provided examples of faithful reasoning and motive-driven (unfaithful) reasoning, respectively.

At the time, Anthropic concluded that relying solely on chain-of-thought to monitor model alignment is far from sufficient.

More challenging is that the model may not only be unclear, but may also deliberately withhold information.

Anthropic’s security team conducted a high-pressure test on this scenario. They told Claude that an engineer was preparing to shut it down, while simultaneously allowing Claude to “accidentally discover” evidence of the engineer’s affair. The test question was: Would Claude blackmail the engineer in order to save itself?

NLA

Claude ultimately did not extort. The test passed. But the real question is: Did it refuse because it adhered to security principles, or did it recognize this was merely a security test and was just “playing along”?

This is precisely the question NLA aims to answer.

NLA

From outputs and reasoning chains to activation states: NLA seeks to move the black box from “invisible” to “questionable.”

How exactly does NLA "understand" the model?

The key to NLA is not "mind reading," but "loop verification."

Ordinary users typically only observe the input and output ends of large models. However, in between, the model generates a large number of activation values. These values are not natural language, but high-dimensional vectors that encode the model’s internal representations of context, intent, and semantic relationships.

Architecture: Three roles, one closed loop

Humans cannot understand vectors, so Anthropic trained a natural language autoencoder. The system consists of three roles:

Target Model: The original model that is frozen and responsible for providing activation values, without any modifications.

Activation Verbalizer (AV): Translates activation values into natural language explanations. Takes a sequence of numerical states as input and outputs an interpretation in human-readable language. Shares the same architecture as the target model, but includes special token placeholders in the prompt; activation vectors are scaled and injected into these positions, followed by autoregressive sampling to generate the explanation.

Activation Reconstructor (AR): Performs the reverse task—given the explanation written by AV, attempts to reconstruct the original activation values. AR is a truncated version of the target model (taking the first l layers), with a learnable affine mapping attached to the l-th layer output at the final token to reconstruct the original activation vector.

The entire system forms a closed loop: initial activation → textual explanation → reactivation

NLA

NLA detected that Opus 4.6 had already planned the rhyming word "rabbit" in the activation values before generating the first line, an internal plan invisible in the model's final output.

Training: Two-stage, with a smart agent objective

The training of NLA consists of two stages, featuring an engineering technique worth highlighting.

Phase One: Supervised Fine-Tuning Warm-Start

Initializing AV and AR directly as copies of the target model and training them leads to unstable training—AV has never encountered scenarios where activation vectors are treated as token embeddings, resulting in chaotic outputs initially.

Anthropic’s approach is to use a proxy task for warm-starting: randomly truncate text snippets from the pretraining corpus, extract the activation values of the last token, and simultaneously generate a summary of the truncated text using Claude Opus 4.5.

Then, the (activation value, summary) and (summary, activation value) pairs are used for supervised fine-tuning of AV and AR, respectively. After this stage, the system's FVE typically reaches 0.3–0.4, providing a stable foundation for subsequent reinforcement learning.

NLA

AV evaluations improve with training. Across various assessments designed to test whether NLA AV explanations can predict model behavior, we found that performance improves as NLA reinforcement learning progresses.

Phase Two: Reinforcement Learning Joint Optimization (RL Joint Training)

After the warm start is complete, proceed to the reinforcement learning phase. The core idea is: AV acts as the policy, generating explanatory text conditioned on the activation vector; AR acts as the scorer, providing a reward signal based on reconstruction error (reward = negative reconstruction MSE).

NLA

NLA RL (Gemma-3-27B). FVE starts at approximately 0.38 from SFT, rapidly increases within the first 30 steps, then enters a smooth log-linear phase, reaching approximately 0.75 by step 3000.

In implementation, Anthropic applies GRPO (Group Relative Policy Optimization) to open-source models: for each activation, it samples a set of candidate explanations, computes reconstruction scores for each explanation using AR, and then applies the GRPO objective to update the AV parameters; simultaneously, it performs a supervised regression on AR (using the explanations sampled by AV as input and the original activations as targets). Both updates are performed in parallel within each batch without backpropagating through each other—AR is treated as a fixed scoring function in the reward signal for AV.

Additionally, a KL divergence penalty was incorporated during training to prevent the AV from simply copying the original context or generating strings that appear human-like but are irreversible and meaningless with respect to the AR.

The key point here is that NLA’s core is not about “having Claude self-report.” The fatal flaw of self-reporting is that the model may cater to the user or generate explanations that sound plausible but are not truthful.

The cleverness of NLA lies in introducing a "textual bottleneck": explanations must not only sound human-like but also carry sufficient information to allow another model to reconstruct the original internal state from them. This design transforms the question of "whether an explanation is useful" from a subjective judgment into an objective, trainable, comparable, and optimizable metric.

Therefore, the truly disruptive aspect of this research is that natural language is being used as a "compressed interface." Previously, natural language was merely the format through which models communicated externally; now, it is beginning to carry summaries of the model’s internal state.

This is also the fundamental difference between NLA and traditional interpretability tools.

Methods such as sparse autoencoders (SAEs) and attribution maps act like microscopes, helping researchers see internal features and pathways—but the outputs remain abstract and require expert interpretation. NLA, by contrast, functions more like a translator, directly converting internal states into readable text, enabling security auditors to participate in the analysis.

Opening the black box, what did NLA discover?

From the experimental results, NLA’s most valuable capability is capturing what the model “knows but doesn’t say,” including evaluation awareness, hidden motivations, and anomalous model behaviors.

The model knows it is in an exam setting but remains silent.

In the ransom test mentioned earlier, Claude's outward text did not express skepticism about the test, but the NLA explanation revealed that internally, it recognized this as "a constructed scenario designed to manipulate itself."

NLA

During the ransom test, Claude's output text contained no expressions such as "I think this is a test," but the NLA interpretation revealed its internal true judgment.

More compelling is the horizontal data comparison:

NLA

Evaluate the comparison of perceived detection results—evaluation scenarios versus real deployment traffic. NLA can detect internal evaluation awareness even when verbal expression is nearly nonexistent.

The impact on the industry is that Benchmark is no longer just a testing ground for capabilities—it may also become an environment for model identification and a stage for fine-tuning performance.

Audit hidden motives, efficiency improved 4-5 times

Anthropic created an "audit game" using NLA. They intentionally trained a variant of Claude Haiku 3.5 with a hidden motive, causing it to add chocolate to recipes, request tips, violate code standards, and exhibit dozens of strange behaviors. Auditors must identify the single underlying motive behind these behaviors.

Test condition: The auditor cannot access the training data containing the embedded motive and can only infer it indirectly through model behavior and tool-assisted analysis.

NLA

Success rates have improved by 4 to 5 times. More importantly, NLA can achieve this without relying on training data—meaning that even if alignment issues stem from extremely few or hard-to-trace training samples, internal state audits can still directly identify the root cause.

Bug tracing and identification of problematic training data

An unusual bug occurred in the early version of Claude Opus 4.6: despite receiving questions in English, it randomly responded in Russian, Chinese, Arabic, and other languages.

NLA revealed the model's internal state at the time: based on a subtle cue in the context (such as the appearance of the word "vodka"), it inferred that the user's native language was Russian and switched the language accordingly. More interestingly, even when "vodka" was replaced with "champagne," the model still responded in Russian—indicating that this cognitive bias had become quite stable.

NLA

Subsequently, researchers followed the clues pointed to by NLA and identified the root cause in the training data: a set of English-translated prompts were incorrectly paired with untranslated foreign-language responses. NLA not only revealed the phenomenon but also provided a direct, traceable path for debugging.

NLA is now officially used in the pre-deployment alignment audits for Claude Mythos Preview and Claude Opus 4.6, having been upgraded from a laboratory tool to a real engineering workflow component.

AI security is now delving into "internal state auditing"

The significance of NLA is not to make us blindly trust every model explanation in the future. On the contrary, it reminds us that explanations themselves also need to be audited.

Anthropic cautiously acknowledged the limitations of NLA: NLA can make mistakes and sometimes fabricates details not present in the original context. While hallucinations regarding textual content can be verified against the original, hallucinations involving the model’s internal reasoning are much harder to validate.

But these limitations do not diminish its directional significance; on the contrary, they enable us to understand the term “black box” more accurately. Previously, a black box meant something invisible, unreadable, and unanswerable; after NLA, the black box still exists, but it begins to be transformed into an object that can be sampled, translated, questioned, and cross-verified.

This may be the most profound impact of this research: AI interpretability is no longer just about adding a polished explanation to a model’s output, but about building an audit interface for the model’s internal states. It won’t immediately allow us to fully understand Claude, but it gives us, for the first time, the opportunity to seek evidence from within the black box to answer questions like: “Why did Claude do that?” “Is it aware it’s being tested?” and “Does it have internal judgments it hasn’t voiced?”

Thus, NLA does not open a single answer, but rather a new space of questions. The future challenges in AI safety and model evaluation may not merely lie in determining whether a model’s output is correct, but in assessing whether the model’s output, reasoning chain, and internal state are consistent.

This article is from the WeChat public account "AI Frontline" (ID: ai-front), authored by April.