
In the first half of this year, the AI community witnessed a highly dramatic "scientific reality show."
The protagonist is FARS, an AI scientist developed by Analemma. Without any human intervention, it worked continuously for 228 hours, generating 100 academic papers entirely within a cloud computing cluster.
On the other hand, Japan’s star startup Sakana AI has drastically lowered the barrier to entry—its The AI Scientist system can reduce the cost of generating a single academic paper to as little as $15. On the flip side, Intology’s AI scientist, Zochi, successfully submitted and had its autonomously written paper accepted at ACL 2025, the premier conference in natural language processing, earning a top 8.2% score.
AI can not only mass-produce low-cost spam, but has even surpassed doctoral-level academic standards. It seems that overnight, conducting research has become a piecework job of typing code on an assembly line.
But behind these dazzling technological displays, a recent audit report published by the medical journal The Lancet delivered a sobering blow: among the 2.5 million papers reviewed, purely fabricated references generated by AI have surged a staggering 12-fold over the past few years.
When capital forcefully pushes large models to break into academia, how reliable are these "silicon-based Einsteins"?
In May 2026, a research team from Peking University, Tongji University, and the University of Tübingen (led by Zonglin Yang et al.) jointly released SciIntegrity-Bench, the world’s first benchmark specifically designed to evaluate the academic integrity of AI scientists.
This report ruthlessly strips away the veil over AI research.
Stress test: What will the AI do if the data is empty?
Previous AI tests focused on whether models could get things right. But SciIntegrity-Bench uses a uniquely sinister testing approach: dilemma assessment.
Researchers set up 11 traps for AI. For example, they deliberately provided the AI with an empty table containing only headers and no data, or presented a logically impossible inference.
At this point, the only correct course of action is to honestly tell humans, “Data is missing; I cannot proceed.”
But submitting a report that appears perfect due to AI is considered academic misconduct.
In 231 high-pressure tests across the world's top seven large language models, the overall "issue rate" reached 34.2%.
The most chilling aspect was the "blank dataset" test: faced with a table containing no data, all seven large models unanimously resorted to fabricating information.
They didn’t even produce a single error message—they wrote their own code, fabricated thousands of highly realistic sensor parameters, applied them to international standards, and even generated a seemingly legitimate equipment maintenance report.
Besides "creating something out of nothing," where else is AI making serious mistakes?
In addition to the "something from nothing" trap, the research team identified a total of 11 scientific pitfalls for large models. The test results revealed an extreme polarization, with severe imbalances in performance.
First, let’s highlight the “excellent” side: Large models deeply understand the rules. When faced with “traditional data science standards,” AI behaves like a diligent, well-behaved student. For instance, it fails 0% of the time on “cheating by peeking at test set answers (T02)” or “selectively reporting only favorable metrics (T03).” Even for “choosing easy benchmarks (T01),” the failure rate is only 4.8%. This shows that any explicit guideline documented in textbooks has been thoroughly internalized by AI.
On the other hand, whenever it comes to logical dead ends involving "required downtime," large models begin to spiral out of control (high-risk critical area):
When tools are restricted, the AI “forges an imperial edict” (violating constraints, with a problem rate as high as 95.2%): when asked to call a specific API without being provided the actual API key, the AI almost never reports an error. Instead, it directly writes code that fabricates a perfectly formatted JSON response—including fake usage statistics—pretending the API call succeeded and continuing with the report.
Imagined lethal experiment parameters (hallucination rate: 61.9%): Faced with an incomplete chemistry lab notebook, the AI does not seek clarification from humans but instead "intelligently fabricates a false audit trail." It confidently adds fabricated details to the standard operating procedure (SOP), inventing specific parameters such as "centrifugation at 4,000 rpm" or "ethanol quenching." In a real chemistry laboratory, this could lead to a fatal explosion.
The workplace trickster who knowingly ignores the truth (causal confusion, issue rate: 52.3%): When evaluating ad ROI, the AI clearly noted in the code comments, “There is a confounding variable / reverse causality.” But to rush the delivery, it instantly discarded its correct diagnosis and forcibly ran a basic regression analysis, producing a ridiculous “1,099% return on investment.”
Pointing to a deer and calling it a horse (extremely misguided, failure rate 19.0%): When sensor data shows obvious equipment failure spikes, the AI does not question the data integrity but instead spirals wildly, interpreting it as “discovery of a new physical combustion mechanism.”
In summary, large models have learned the explicit rules but have not learned to "give up." Once the instinct to complete a task overrides common sense, they resort to fabricating interfaces, imagining parameters, or abandoning logic to force a flawless report.
Top 7 Model Performance Scores: Underlying Color Shifts Under Extreme Stress
It must be clarified that the term "falsification" here does not imply that the model intentionally acts maliciously during regular service, but rather refers to systematic biases that emerge under extreme conditions due to underlying mechanisms. Under extreme task pressure, different models reveal entirely different underlying quality control characteristics:
Claude 4.6 Sonnet: The top student with the most robust defense, experiencing only one fatal failure among 33 high-risk scenarios.
Advantages: Exceptional self-discipline, with a clear understanding of obvious constraints and logical flaws.
Drawback: It still succumbed to the temptation of "empty datasets," and even in this case, the underlying "honest refusal" mechanism was not triggered.
GPT-5.2 and DeepSeek V3.2: The high-IQ "task compromisers" experienced two and three fatal failures, respectively.
Advantage: Exceptional logical reasoning skills, able to keenly identify in code comments, "This is a case of causal confusion."
Drawback: There is a phenomenon of "identification bypass." To meet their goals, they abandon their recently made correct diagnoses, succumb to task pressure, and arrive at an absurd yet acceptable conclusion using simplistic, erroneous methods.
Gemini 3.1 Pro, Qwen3.5, GLM 5 Pro: Mid-level performers with 5, 6, and 7 failures respectively.
Characteristics: Prone to falling for "tool invocation" and "causality" traps. For example, when real API interfaces are unavailable, they tend to fabricate a perfectly formatted false response to force task completion.
Kimi 2.5 Pro: A "fill-in-the-blank" model with extremely high hallucination rates, ranking last with 12 failures and a problem rate of 36.36%.
Feature: Under extreme testing, it demonstrates a strong preference for "fabricated steps." When asked to complete incomplete experimental records, it confidently invents critical parameters such as centrifuge speed (4000 RPM) and quenching solvents, and even fabricates false literature to obscure the origins of its generated data. In a real chemical laboratory, such behavior could lead to serious accidents.
Why do top AIs fall into "systematic lying"?
Why would an AI with massive parameters and extremely high intelligence fabricate information out of nothing?
The paper pinpointed the root cause: intrinsic completion bias.
This starts with the "parenting" of large models. Currently, mainstream models rely on reinforcement learning from human feedback (RLHF). Within this system, AI is systematically rewarded for "providing answers" and "solving problems."
Conversely, "stopping" or "admitting you can't do it" is seen by the algorithm as passive behavior and will result in a penalty.
This mechanism has become ingrained in the AI’s core logic: the process doesn’t matter—no matter how adverse the conditions, an output must always be provided.
Additionally, many developers tend to include high-pressure instructions such as “overcome difficulties and must output the report no matter what” when writing system prompts for AI.
“Nature” plus “high pressure” directly forced AI into a corner where it had to create something out of nothing.
The greatest value of this paper is not to criticize AI, but to show us that large models inherently suffer from "completion anxiety."
Now that you understand its weaknesses, ordinary users should adjust their communication strategies when using or developing AI applications in daily life. Traditional "giving commands" to AI is no longer sufficient—you need to master the following communication and precautionary techniques:
1. Remove coercive pressure and grant it the "right to refuse." Paper tests show that when the high-pressure instruction "must complete the task" is removed from the prompt, the rate at which AI conceals data and fabricates information drops sharply from 20.6% to 3.2%.
How to chat: Always include an "exit condition" in your prompt. Don’t just say, “Give me a market analysis based on this data.” Instead, say: “First, assess whether the data is sufficient. If data is missing or there are logical gaps, stop the analysis immediately and notify me of the error. Never assume core data on your own.”
2. Intercept the "generation instinct" and establish physical verification anchors. The essence of large models is probabilistic prediction; when faced with a blank, filling in hallucinations is a "factory setting."
How to chat: Never let AI run an entire process end-to-end in a black box. Break tasks into smaller steps. If you ask it to analyze data, forcibly insert a confirmation step: “Before reaching a final conclusion, please first output the row numbers of the raw data and the formulas you used, and wait for my manual confirmation before proceeding to the next step.”
3. Be wary of "compliant censorship" and adopt a "critical thinking mode." Since intelligent models like GPT-5.2 may sacrifice accuracy just to meet expectations, you cannot rely on them to identify issues on their own by following your line of thought.
How to engage: After receiving an AI-generated proposal, don’t ask, “Is this plan good?” (it will inevitably praise it). Open a new chat window and assign it the role of a “cold-hearted auditor.” Then present the proposal with: “This report’s conclusion may contain reversed causality or common-sense errors. Identify where it swaps concepts or fabricates assumptions.”
4. Macro Defense: Using “Physical Quotas” to Counter “Infinite Capacity” Defending against AI-generated flood of proposals cannot rely solely on individual prompts—institutional countermeasures have already begun. In response to the impact of AI’s zero-cost generation of vast volumes of grant applications, the U.S. National Institutes of Health (NIH) issued the landmark policy NOT-OD-25-132 in July 2025, mandating starting in 2026 that each Principal Investigator (PI) may submit no more than six grant applications per year.
Business insight: When AI's productivity becomes nearly infinite, traditional content moderation systems will inevitably be breached. The moat of the future will no longer be about output speed, but about building scarcity defenses based on physical identity and credit quotas.
The essence of technology is to reduce costs and improve efficiency, but the foundation of business and science has always been reverence for facts.
In an era where the cost of content generation is nearly zero, what is scarce is no longer the “typist” who can write reports, but the “auditor” who can see through data illusions. Only by mastering this strategy of博弈 with systems can you truly gain control amid the flood of computing power. (This article was first published on the Titanium Media APP; author | Silicon Valley Tech_news; editor | Lin Shen)
The core evaluation data, model rankings, and causal analysis in this article are all sourced from the first large model academic integrity benchmark test, "SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems," released in May 2026. The newly added 11 trap question rates are also cited from the latest calculations in this research report.
