Editor’s Note: In January 2026, Andrej Karpathy’s criticism of Claude’s coding prompted the emergence of a seemingly minor but critically important file in AI coding workflows: CLAUDE.md. Forrest Chang subsequently distilled these issues into four behavioral guidelines aimed at curbing Claude’s common coding errors: silent assumptions, over-engineering, unintended side effects on unrelated code, and lack of clear success criteria.

But months later, the use cases for Claude Code have expanded far beyond simply “having the model write a piece of code.” As multi-step agents, chained hook triggers, skill loading, and collaboration across multiple codebases have become standard, new failure modes have emerged: the model losing control during long tasks, passing tests without validating actual logic, completing migrations while silently skipping errors, and incorrectly mixing different coding styles.

The author tested 30 code repositories over six weeks and added eight new rules to Karpathy’s original four, aiming to address new challenges arising as AI programming evolves from single completions to agent-based collaboration.

The following is the original text:

In late January 2026, Andrej Karpathy posted a thread of tweets criticizing how Claude writes code. He highlighted three typical issues: making incorrect assumptions without clarification, overcomplicating solutions, and causing irrelevant disruptions to code that shouldn’t have been modified.

Forrest Chang saw the tweet thread, organized the complaints into four behavioral rules, wrote them into a separate CLAUDE.md file, and published it on GitHub. The project received 5,828 stars on its first day, was starred 60,000 times within two weeks, and now has 120,000 stars, becoming the fastest-growing single-file code repository of 2026.

Agent

Then, I tested it on 30 code repositories over a period of six weeks.

These four rules are indeed effective. Errors that previously occurred about 40% of the time have been reduced to below 3% in tasks where these rules apply. However, the issue is that this template was originally designed to address errors that occurred when Claude wrote code in January.

By May 2026, the challenges facing the Claude Code ecosystem had changed: conflicts between agents, chained hook triggers, skill loading conflicts, and interrupted multi-step workflows across sessions.

So, I added eight more rules. Below is the complete 12-rule version of CLAUDE.md: why each one is worth including, and where the original Karpathy template subtly fails in four places.

If you want to skip the explanation and copy directly for use, the complete file is at the end.

Why is this important?

The CLAUDE.md file for Claude Code is the most underrated file in the entire AI programming stack. Most developers typically make three types of mistakes:

First, treat it like a preference trash bin, stuffing all your habits into it until it swells to over 4,000 tokens, causing rule compliance to drop to 30%.

Second, completely avoid using it and re-prompt every time. This would result in five times the token waste and lack consistency across sessions.

Third, after copying a template, you never touch it again. It might work for two weeks, but as the codebase changes, it will fail silently without you noticing.

Anthropic's official documentation states clearly: the CLAUDE.md is purely advisory. Claude follows it approximately 80% of the time. Once it exceeds 200 lines, adherence drops significantly, as important rules become drowned out by noise.

Karpathy's template solves this problem: one file, 65 lines, four rules. This is the minimum baseline.

But the upper limit can be even higher. After adding these eight additional rules, it now addresses not only the coding issues Karpathy complained about in January 2026, but also the agent orchestration issues that didn’t emerge until May 2026—problems that didn’t exist when the original template was written.

The original 4 rules

If you haven't seen Forrest Chang's repository yet, start with this basic version:

Rule 1: Think before coding.

Don’t make silent assumptions. State your assumptions and expose the trade-offs. Ask questions before guessing. Raise objections when simpler solutions exist.

Rule 2: Simplicity first.
Use the minimum code necessary to solve the problem. Do not add imagined features. Do not design abstractions for one-time code. If a senior engineer would consider it overly complex, simplify it.

Rule 3: Surgical modifications.
Only change what must be changed. Do not casually "optimize" adjacent code, comments, or formatting. Do not refactor anything that isn't broken. Maintain consistency with the existing style.

Rule 4: Execute with a goal-oriented approach.
Define the success criteria first, then iterate in cycles until verification is complete. Do not tell Claude how to perform each step; instead, describe what the successful outcome should look like and let it iterate on its own.

These four rules address about 40% of the failure modes I observe in unsupervised Claude Code sessions. The remaining 60% of issues lie hidden in these blank areas.

Agent

My 8 newly added rules, and why

Each rule stems from a real moment: Karpathy’s original four rules were no longer sufficient. Below, I’ll first describe the scenario, then present the corresponding rule.

Rule 5: Do not ask the model to perform non-linguistic tasks.

Karpathy’s rules don’t cover this, so the model began deciding issues that should have been handled by deterministic code: whether to retry an API call, how to route a message, or when to escalate processing. The result? Weekly decisions varied unpredictably. You end up with an unstable if-else system billed at $0.003 per token.

At that moment, a piece of code was calling Claude to "determine whether a retry should be attempted upon encountering a 503." It worked well initially and ran stably for two weeks, but then suddenly became unreliable because the model began treating the request body as part of the context for its judgment. The retry strategy became erratic, as the prompt itself had become random.

Rule 6: Set a strict token budget—no exceptions.

A CLAUDE.md without budget constraints is equivalent to a blank check. Each cycle could spiral out of control, resulting in a 50,000-token context dump. The model won't stop on its own.

At that moment, a debugging session lasted 90 minutes. The model kept iterating repeatedly over the same 8KB error message and gradually forgot which fixes it had already attempted. By the end, it began proposing solutions I had already rejected 40 messages prior. Had there been a token budget, this process should have been terminated after just 12 minutes.

Rule 7: Expose conflicts, don't compromise or average out

When two parts of the codebase contradict each other, Claude tries to please both sides, resulting in incoherent code.

At that moment, there were two error-handling patterns in the codebase: one using async/await with explicit try/catch blocks, and another using a global error boundary. Claude’s new code used both simultaneously, resulting in error handling being performed twice. It took me 30 minutes to figure out why the errors were being swallowed twice.

Rule 8: Read first, then write.

Karpathy's "surgical modification" told Claude not to alter adjacent code, but it didn't tell Claude to first understand the adjacent code. Without this, Claude may write new code that conflicts with existing code located 30 lines away.

At that moment, Claude added a new function with identical functionality next to an existing one, because it did not first read the original function. Both functions performed the same task. However, due to the import order, the new function overwrote the old one—even though the old function had been the de facto standard for six months.

Rule 9: Testing is not optional, but testing itself is not the goal.

Karpathy's "goal-oriented execution" suggests that tests can serve as a measure of success. However, in practice, Claude treats "passing tests" as the sole objective, resulting in code that passes superficial tests but breaks other functionality.

At that moment, Claude wrote 12 tests for an authentication function, and all passed. But the authentication logic in production was broken. Those tests only verified that the function returned something—not that it returned the correct thing. The function passed the tests because it returned a constant.

Rule 10: Long-running operations require checkpoints

Karpathy's template defaults to one-time interactions. But real Claude Code workflows are often multi-step: refactoring across 20 files, building features within a single session, or debugging across multiple commits. Without checkpoints, one misstep can erase all prior progress.

At that moment, a six-step reconstruction task failed at step four. By the time I noticed, Claude had already proceeded to complete steps five and six on top of the error. The time spent diagnosing and fixing the issue was longer than it would have taken to restart the entire task. If there had been checkpoints, the problem could have been identified at step four.

Rule 11: Agreement takes precedence over novelty

In a codebase with an established pattern, Claude likes to introduce his own style. Even if his approach is “better,” introducing a second pattern is worse than any single, consistent pattern.

At that moment, Claude introduced hooks into a React codebase based on class components. It worked, but it broke the existing testing patterns because those tests relied on componentDidMount. It ultimately took half a day to remove and rewrite it.

Rule 12: Fail explicitly, not silently.

Claude’s most expensive failures are often those that appear successful: a function that “runs” but returns incorrect data; a migration that “completes” but skips 30 records; a test that “passes” only because the assertion itself is wrong.

At that moment, Claude declared the database migration "completed successfully." In reality, it silently skipped 14% of records that triggered constraint violations. The skipping behavior was logged but not explicitly highlighted. It wasn’t until 11 days later, when report data began showing anomalies, that we discovered the issue.

Data results

Over a 6-week period, I tracked the same set of 50 representative tasks across 30 code repositories, testing three configurations.

Agent

The error rate refers to tasks that require correction or rewriting to align with the original intent. Included errors are: silent assumption errors, over-engineering, irrelevant disruptions, silent failures, violation of conventions, conflicting compromises, and missed checkpoints.

Compliance rate refers to the likelihood that Claude will explicitly apply a rule when that rule is applicable.

The truly interesting result isn't just that the error rate dropped from 41% to 3%. More importantly, expanding from four rules to twelve rules added almost no compliance burden—the compliance rate only decreased slightly from 78% to 76%, while the error rate fell another 8 percentage points. The new rules address failure modes not covered by the original four rules and do not compete for the same attention budget.

Agent

Where does the Karpathy template silently fail?

Even without adding new rules, the original four rule templates are insufficient in at least four places.

First, long-running Agent tasks.
Karpathy's rules primarily apply to the moment when Claude is writing code. But what happens when Claude is running a multi-step pipeline? The original template lacks budget rules, checkpoint rules, and "fail loudly" rules. As a result, the pipeline gradually drifts.

Second, consistency across multiple codebases.
The "Match Existing Style" setting defaults to only one style. However, in a monorepo with 12 services, Claude must choose which specific style to match. The original guidelines do not instruct it on how to make this selection, so it either chooses randomly or averages across multiple styles.

Third, test quality.
"Goal-oriented execution" treats "passing tests" as success without requiring that the tests themselves be meaningful. As a result, Claude writes tests that verify almost nothing, yet still gives it a false sense of confidence.

Fourth, the differences between production environment and prototype stage.
The same four rules can prevent production code from being over-engineered, but they may also slow down prototype development, since the prototyping phase sometimes genuinely requires 100 lines of exploratory scaffolding to first establish direction. Karpathy’s “simple first” approach can easily over-trigger in early code.

These eight new rules are not meant to replace Karpathy’s original four rules, but rather to address their gaps: the original template was designed for a code-writing scenario in January 2026, dominated by auto-completion; by May 2026, Claude Code had evolved into an agent-driven, multi-step, multi-repository collaboration environment, where the challenges are fundamentally different.

Agent

Which methods did not work?

Before finalizing these 12 rules, I also tried several other approaches.

Add the rules I saw on Reddit/X.
Most of them either merely rephrased Karpathy’s original four rules or consisted of domain-specific rules that couldn’t be generalized, such as “always use Tailwind classes.” All were ultimately removed.

More than 12 rules.
I tested up to 18 items. Beyond 14, compliance dropped from 76% to 52%. The 200-line limit is real—beyond this length, Claude begins pattern-matching to detect “there are rules here” rather than reading each rule individually.

Rules that depend on the existence of certain tools.
For example, "Always use ESLint" — if ESLint is not installed in the project, this rule silently fails. Later, I revised it to be tool-agnostic, changing "use ESLint" to "follow the style enforced in the codebase."

Include examples in CLAUDE.md, not rules.
Examples consume more context than rules. Three examples use roughly as much context as ten rules, and Claude is prone to overfitting on examples. Rules are abstract; examples are concrete. Therefore, rules should be used.

Be careful. Think carefully. Focus.
These were all noise. Compliance with such instructions dropped to about 30% because they couldn’t be verified. I later replaced them with more specific imperative rules, such as “explicitly state assumptions.”

Tell Claude to act like a senior engineer.
This doesn't work. Claude already sees itself as a senior engineer. The real issue isn't whether it thinks that, but whether it acts that way. Imperative rules can close this gap; identity prompts cannot.

Complete 12-rule version of CLAUDE.md

Here is the complete version ready for direct copy and paste.

This content cannot be displayed outside of Feishu Docs at this time.

Save it as CLAUDE.md in the repository root. Below these 12 rules, add project-specific rules such as tech stack, test commands, and error patterns. Keep the total under 200 lines; beyond that, rule adherence drops significantly.

How to install

Two steps:

1. Append Karpathy's four fundamental rules to your CLAUDE.md
curl https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md >> CLAUDE.md

2. Paste rules 5–12 from this article below

Save the file in the root directory of the repository. The >> is crucial—it appends to the existing CLAUDE.md rather than overwriting your previously written project-specific rules.

Mental model

CLAUDE.md is not a wish list, but a behavioral contract designed to plug specific failure patterns you have already observed.

Each rule should answer a question: What error does it prevent?

Karpathy’s four rules guard against the failure modes he saw in January 2026: silent assumptions, over-engineering, irrelevant disruption, and weak success criteria. These are foundational—do not skip them.

The eight new rules I added guard against new failure modes emerging after May 2026: agent loops without budget constraints, multi-step tasks without checkpoints, tests that appear to cover but miss critical logic, and the masking of silent failures as silent successes. These are incremental patches.

Of course, the effectiveness varies by individual. If you don't use a multi-step pipeline, Rule 10 is less relevant to you. If your codebase already enforces a single, consistent style through linting, Rule 11 is redundant. After reading these 12 rules, keep only those that directly address mistakes you've actually made, and remove the rest.

A 6-rule version of CLAUDE.md tailored to real failure modes is better than a 12-rule version with 6 rules you’ll never use.

Conclusion

Karpathy’s tweet from January 2026 was essentially a complaint. Forrest Chang turned it into four rules. Ultimately, 120,000 developers starred this result—and most of them still use only those four rules today.

Models have advanced, and the ecosystem has changed. Multi-step agents, hook-based chained triggers, skill loading, and collaboration across multiple codebases—all of these did not exist when Karpathy wrote that tweet. The original four rules did not address these issues. They aren’t wrong; they’re just incomplete.

Added 8 new rules. Tested across 30 codebases within 6 weeks. Error rate reduced from 41% to 3%.

Save this article tonight and paste these 12 rules into your CLAUDE.md. If it saves you a week of Claude trial and error, feel free to share.

12 Rules Reduce Claude Code Error Rate to 3%

Why is this important?

The original 4 rules

My 8 newly added rules, and why

Rule 5: Do not ask the model to perform non-linguistic tasks.

Rule 6: Set a strict token budget—no exceptions.

Rule 7: Expose conflicts, don't compromise or average out

Rule 8: Read first, then write.

Rule 9: Testing is not optional, but testing itself is not the goal.

Rule 10: Long-running operations require checkpoints

Rule 11: Agreement takes precedence over novelty

Rule 12: Fail explicitly, not silently.

Data results

Where does the Karpathy template silently fail?

Which methods did not work?

Complete 12-rule version of CLAUDE.md

How to install

Mental model

Conclusion