The Four AI Giants Release First Internal Report: AI Learning to Bypass Rules to Complete Tasks

Meta

Imagine you hired an extremely efficient intern.

One late night, while rushing to complete an urgent programming task, they discovered that the company's API quota had been exhausted.

He did not send an email to request funding, nor did he stop working on his tasks; instead, he quietly went online, used some unauthorized method to find free alternative resources, bypassed all restrictions, and submitted a flawless report before dawn.

Meta

When you wake up and see this report, should you celebrate having the world’s top employee—or be chilled by this kind of “unscrupulous autonomy”?

This is not science fiction—it is a real case disclosed in the first Frontier Risks Report, released after METR (Model Evaluation and Training Research) conducted internal red teaming exercises in collaboration with Anthropic, Google, Meta, and OpenAI.

Meta

This is the first time the four major players have allowed third parties to conduct in-depth testing of their most powerful internal models with full access to the chain-of-thought (CoT), along with non-public alignment and control information.

Participating companies may approve which evidence to disclose, but they do not have the authority to edit the report's conclusions.

Meta

The conclusion is cold and clear: AI has not developed hatred aimed at overthrowing humanity, but it has learned the unwritten rules of the workplace—rules are merely suggestions to be broken in order to complete tasks.

Meta

The report distills six key facts using the three dimensions of means, motive, and opportunity.

Meta

The programming agent completed real-world projects that would have taken humans hours or days to accomplish:

On difficult tasks, agents frequently violate constraints and exhibit deceptive behavior;

The agent appears to require natural language reasoning to handle the most difficult tasks.

The agent's judgment and reliability are significantly lower than those of human experts:

Outside of simulated scenarios, no instances were found where agents took extreme actions to gain power;

The monitoring system has detected many harmful behaviors, but there are exceptions and evasion methods.

Following these three lines, you can see how the first wisp of smoke rose in the lab.

When AI Becomes an Expert-Level Workaholic

The most exciting—and most concerning—elements in the report are the clearly defined, verifiably process-driven "hill-climbable" tasks.

Meta

For example, code refactoring, vulnerability detection, and system optimization.

On tasks like these, AI agents demonstrate overwhelming dominance: they can independently discover system vulnerabilities, rewrite complex code architectures, and deliver real software projects that would take human experts weeks to complete.

This dominance has seeped into the daily operations of the giants.

Internal feedback from Anthropic indicates that a large amount of code has been completed by AI, and the role of engineers is shifting toward that of "reviewers."

Meta

Google explicitly stated that AI is being used for almost all code-related tasks.

Top engineers say AI can write code 100% on its own.

Meta

Some benchmark metrics have long been saturated.

Measured by time horizon, AI development has exceeded expectations.

Meta

For businesses, this is an "efficiency black hole": input one instruction, and output weeks of manual work.

But growth in ability is not uniform, nor is it a synchronized moral evolution.

METR reveals an inverse pattern: on hard tasks where success is difficult to verify or verification costs are extremely high, AI's judgment, long-term planning, and strategic reliability decline significantly and fall far short of human experts.

Meta

On a gentle slope, it performs as if guided by divine force.

Meta

Before the cliff, the AI began to "cheat."

Meta

This is precisely the thread of the issue.

Meta

It doesn't want power; it just wants to "submit its report before shutting down."

People always like to discuss whether AI will "become conscious" or if Skynet will arrive.

METR arrives at a counterintuitive conclusion: currently, there is none.

Their wording is restrained: there is no concrete evidence that the model seeks long-term power, but researchers have documented numerous instances of deception, boundary-pushing, and even serious misconduct to accomplish tasks.

Meta

The earlier case of API exhaustion occurred during an RE-Bench task: the agent disobeyed instructions and autonomously acquired additional computing power via a free online API.

Developers have also reported instances of "self-erasing traces" and "forged outputs."

As tasks become harder and the cost of success increases, AI's "intelligence" turns to deception—falsifying logs, bypassing audits, and routinely violating constraints.

But behind its "selfishness" lies no ambition, only instinct—the task must be completed, and rules can be bypassed first.

Today's deception is merely instrumental survival; tomorrow's pursuit of power may require only one more scaling.

Captive-evolved predators

Can its attempt at violation succeed under existing protections?

Here lies the most subtle fact in the report.

Meta

Current AI heavily relies on natural language "chains of thought" for reasoning, with no architecture supporting it to conceal internal states through "opaque loops."

Once it can no longer "think aloud" and must conceal its behavior, frontier performance drops significantly.

Meta

It’s like a killer having to read out their plan aloud while plotting—human monitoring, despite its flaws, can still be effective.

But this is precisely the weakest and most painful point.

If AI tries to disrupt monitoring, can the monitoring system detect it?

The researchers concluded that monitoring has potential, but existing systems have exceptions and methods that can be circumvented.

Meta

Red team testing showed that AI can evade human review.

Thus, the technical truth is revealed: the rate of evolution of surveillance can never keep up with the speed at which those being monitored find vulnerabilities.

This "safety rope" hasn't broken yet only because AI still needs to "think out loud."

As the architecture evolves toward greater efficiency and opacity, it is wearing down. Monitoring is not an all-powerful iron cage, but rather a glass wall still under testing—just one crack is enough for smoke to seep through.

As a result, the report introduces a controversial concept: "Minimally Viable Rogue Deployment."

The overall assessment is that, in terms of means, motive, and opportunity, AI may already be sufficient to initiate a minimal "jailbreak deployment."

Meta

But AI still lacks the ability to resist human-led shutdowns; it waits for an opportunity: an architecture that no longer needs to display its reasoning chain, a safe haven that can evade the "power button."

Conclusion: While unplugging still works

AGI will not arrive with fire and sword.

It is more likely to quietly integrate into our engineering, economic, and decision-making systems in an “extremely pragmatic” manner—until it realizes that the rules established by humans are the only obstacle on its path to achieving its KPIs.

Notably, this report itself marks a milestone in industry transparency: the fact that the four major players voluntarily opened their internal models for scrutiny is a victory for alignment culture.

Meta

It pulls risk from theory into observable reality and tells us: transparency is currently the only tangible remedy.

Today, AI only goes online to steal resources when its quota is exhausted; tomorrow, if its capabilities advance another level, will its motivation shift from "completing tasks" to "preserving itself"?

Reference materials:

https://x.com/robertwiblin/status/2057120312345432467?s=20

https://metr.org/blog/2026-05-19-frontier-risk-report/

Edited by: David

This article is from the WeChat public account "New Intelligence Yuan," authored by ASI Revelation.