The release of MiniMax's M3 model has garnered global attention, with Vercel’s CEO publicly endorsing it, though there is debate within the domestic community regarding its pricing. Developers have validated M3’s capabilities through blind and real-world tests, showing its code generation performance rivals that of Claude Opus 4.8 and placing it among the top ten globally across multiple benchmarks, making it the strongest open-source model to date. The model employs MiniMax’s new Sparse Attention architecture, reducing computational cost to just 1/20th of previous levels under a 1 million token context. MiniMax has also launched the Agent Team feature, enabling collaboration among three types of agents: Leader, Worker, and Verifier. The model weights and full technical report will be open-sourced within ten days, allowing global developers to test the model in real-world projects.

Article author and source: AI New Era

New Ze Yuan report

[New Intelligence Yuan Summary] Silicon Valley heavyweights have endorsed it, but the community is in an uproar. Can MiniMax M3 stand up to real-world testing? Developers around the world are already putting it to the test.

Recently, the same model has been trending both domestically and internationally.

Guillermo Rauch, CEO of Vercel with 5.4 million followers, has publicly endorsed it—an extremely rare occurrence.

He recommended MiniMax M3, a model entirely developed in China.

However, this same M3 has also drawn significant criticism, with comment sections on many Chinese community platforms descending into heated arguments.

Much of the criticism has focused on the price adjustment of the Token Plan, with many long-term users feeling their rights have been diluted and raising loud objections.

The atmosphere in overseas communities is completely different from that in domestic ones.

Some overseas developers are speculating about M3's architecture parameters, sparse attention mechanisms, and training data scale.

For example, user Rohan on X said that looking at price alone is meaningless; while cost is also important, he is more interested in how the model makes mistakes and its actual performance within an agent system.

Another netizen was more direct, saying, "It's already impressive that M3, as an open-source model, can keep up with Opus and GPT-5—but before I believe these claims, I need to see it fail in real time."

In response to these external evaluations, MiniMax quickly issued a compensation plan the same day: existing users retain their original benefits, while new users receive a 50% increase in their weekly limit.

The pricing issue is resolved; now, the most substantive question is: Is M3 genuinely strong, or is it an illusion created by rank manipulation?

72 hours

A global developer-led "rigorous verification"

To verify M3’s real-world performance, developer Victoria Wu fed the same prompt—generating an animation of a pelican riding a bicycle—to M3, Sonnet 4.6, and Opus 4.8.

Then, label the three results as A, B, and C, and let netizens guess blind which one is M3.

The comments section was almost entirely unanimous: "A was too smooth—it must be Opus," "M3 is probably B or C."

The result is in. A is M3.

Similarly, developer JAZII conducted a set of blind-controlled experiments.

He used the exact same prompt, asking the model to manually create a Minecraft clone in HTML using Three.js from scratch. The contestants were M3 and Opus 4.8.

Although M3 took slightly longer, JAZII ultimately produced two words: "Super close."

On the left is M3, on the right is Opus 4.8—did you guess correctly?

On X, Chinese developer "Shijian Ge Minli" pushed M3's multimodal and agentic coding capabilities to their limits, using M3 to create a hand-gesture duel game based on "A Mortal's Journey to Immortality."

During this process, M3 needed to understand complex visual gestures and execute highly lengthy logical code. When run end-to-end, Token consumption was only 20% of that required by Claude Sonnet.

AI evaluator Thomas Wiegold, known for his rigor, released a 3,000-word hands-on report immediately.

His evaluation of M3 is: "This is one of the most interesting models I've tested this year."

The last Chinese model to shake Silicon Valley was DeepSeek V4, released six months ago.

This time, the impact brought by MiniMax M3 seems even more profound.

Drop in a 50-page paper, and the M3 will break it down on its own.

Watching others test isn't enough. We took matters into our own hands and specifically chose two of the most challenging problems to push the model to its limits.

The first is a 50-page technical report on DeepSeek-V3, packed with dense charts, intertwined formulas and pseudocode, maximizing information density.

First, let M3 map out a causal technical chain regarding "overlapping underlying communication and computation" to see if it can clearly clarify the most rigorous engineering logic in this paper.

M3 thought through 15 times, executed 19 commands, and called 1 tool.

In the end, it clearly breaks down the complete implementation path of the DualPipe scheduling strategy, with no gaps in the logical chain.

Swipe up or down to view

Next, we'll test M3's multimodal capabilities.

Upload a diagram of the MLA structure, then ask the model to identify which mathematical formulas in the text correspond to the dynamic scheduling and projection processes in the diagram.

M3 quickly provided the corresponding analysis, accurately pinpointing it.

The difficulty continues to increase. If a connection in the diagram conceals a deeper hidden constraint within the textual description in the body, ask M3 to identify its visual location in the diagram and explain the underlying reason.

M3 added annotations directly on the MLA architecture diagram and provided a detailed breakdown of the three constraints.

A 2-hour GTC talk, with M3 producing the draft directly.

The second question is more challenging—it’s not just about understanding, but also about writing it out.

The material for this session is the full one-hour, fifty-seven-minute keynote speech from NVIDIA's GTC conference, along with the writing guidelines, all dumped at once onto M3.

After watching the video, produce a in-depth report of 3,000 to 40,000 words following the specified guidelines.

Faced with a 1.15GB original video, most ordinary AI tools would likely return an error and fail.

However, with the support of the MiniMax Code system-level toolkit, M3 immediately found a solution—

By invoking FFmpeg to compress and segment the file, I paved my own way forward.

After finishing all 12 segments, M3 delivered an impressive list of assets.

Timestamps are precise to the minute, with extremely detailed frame capture.

The black leather jacket with scale-like texture that Lao Huang was wearing, the close-up of him pulling out the N1X chip from his pocket and holding it above his head for a full 15 seconds, and the joke about “probably 2,000 people pulling behind” when pushing the Vera Rubin prototype onstage—all are included.

It didn’t even miss the Chinese phrase “too much stuff” that Lao Huang suddenly blurted out.

Even more striking, M3 also presented what it considers the three most impactful points, each accompanied by its own reasoning.

After confirming the material list, M3 began writing.

The opening begins with the scene of Old Huang rummaging through his pockets, concluding with the elevation that "the owner of this entire supply chain is shifting from humans to agents."

Draft of 3,500 words, submit in 40 minutes.

Although it hasn't yet reached our publication standard, it provides a sufficiently high-quality starting point.

After watching a two-hour video with multimodal analysis, the long-context capability loads all materials, writing guidelines, and sample essays into a single window, while the agent handles whatever challenges arise.

M3's three core capabilities have been pushed to their absolute limits in this task; lacking any one of them would make it impossible.

12 model scores; M3 created a comprehensive overview chart on its own.

For the third question, shift the focus away from long-form text and instead test diagram reading, internet research, and engineering skills.

Each model release comes with a benchmark comparison chart, but the formats vary widely—some use tables, others bar charts or radar charts—and the data metrics are inconsistent.

To compare side by side, you have to manually flip through each page and match each cell—it’s extremely painful.

This time, directly feed M3 with ten benchmark screenshots from official blogs of different models and third-party evaluation platforms, letting it understand all the charts on its own, connect to the internet to fill in missing data, standardize the metrics, and create an interactive comparison dashboard.

First, identify the model names and scores in each screenshot. Normalize data from charts with different formats on your own. For missing data in the screenshots, directly look up and supplement from official sources online.

The final output is a dark, interactive dashboard in the style of a Bloomberg Terminal.

12 models, 14 benchmarks—comprehensive rankings, radar chart comparisons, individual bar charts, and price/performance scatter plots—all in one interface.

Three capabilities, all maximized at once

After completing the three questions, the boundaries of M3’s capabilities are already clear. The next question is: how did it achieve this?

The answer is the simultaneous presence of three core capabilities: cutting-edge programming, a 1M context window, and native multimodality.

Their foundation is a novel attention architecture called MiniMax Sparse Attention (MSA).

Traditional attention mechanisms experience exponential growth in computational load when processing context sizes in the millions, exhausting GPU memory and computing power.

MSA eliminated this bottleneck using block-level sparsity.

At the operator level, it ensures that each block of KV data is read from memory only once, with completely contiguous memory access and no redundant data movement.

The effect can only be described as violent.

Under the enormous context of 1 million, M3 has reduced the computation per token to just 1/20 of the previous generation. Prefill acceleration exceeds 9x, and decoding acceleration exceeds 15x.

The multimodal side is equally aggressive. M3 is not merely a patched-together model that trains text first and then tacks on a vision module.

From the very first step of training, text, images, and videos were fed in together. To achieve this, the research team completely restructured the entire data pipeline and scaled up the pretraining size directly to the 100TB level.

As a result, M3 achieved the highest global ranking among open-source models on the Artificial Analysis Comprehensive Intelligence Index, placing seventh worldwide.

On the GPQA Diamond scientific reasoning leaderboard, M3 achieved 93.2%, ranking among the top four globally, surpassing Claude Opus 4.8 and Opus 4.7.

On the long-context reasoning leaderboard, M3 ranks in the top six with a score of 74.0%, closely competing with the GPT-5 series.

On the GDPval-AA Real Task Agent leaderboard, M3 ranks fifth globally with a score of 1,670, just 6 points behind Sonnet 4.6.

Each ranking evaluates different criteria, but M3 consistently holds a position at the threshold of the top tier for closed-source models and at the forefront of open-source models.

Swipe left or right to view

On the well-known third-party multimodal ranking Vals Index, M3 also reached sixth place globally.

This is the best result achieved by domestic open-source models to date and the highest global ranking among open-source models.

From an overall perspective, M3 has firmly surpassed Claude Sonnet 4.6.

Although it still lags slightly behind the strongest models, Opus 4.7 and GPT-5.5, there is no doubt that it has entered the group of death.

One agent isn't enough? Then bring in a team.

The next natural question is: what hardware do you use to run such a model?

In the previous real-world test, M3 used ffmpeg to cut the video and produced the output in 40 minutes, running on MiniMax Code.

But that was just a single agent working. The most exciting part of this upgrade is the Agent Team.

Anyone who has used AI programming tools has likely had this experience.

You assigned the Agent seven tasks, and it completed three before pausing to report, “I’ve finished tasks 1, 2, and 3—should I continue?” Or, midway through, it suddenly changes tone: one moment it’s a reliable engineer, the next it’s speaking nonsense.

To address this, the Agent Team separated the referees from the participants.

The Leader is responsible for understanding objectives, breaking down tasks, and coordinating resources. The Worker is responsible for executing tasks, with different Workers having distinct tools and contexts. The Verifier is responsible for reviewing and approving work, specifically challenging the Worker’s output.

The worker completes the task, and the verifier starts reviewing it for issues. If problems are found, the task is sent back for revision. Once the verifier finishes checking, the worker revises the task based on the feedback. This adversarial cycle does not rely on the model to decide when to stop; instead, it is governed by an underlying state machine engine.

The most satisfying real-world experience is that when you send a message, M3 instantly confirms it, while multiple background workers are already running in parallel.

Midway through, you added a new request: “Also, could you check this for me?” The leader responded immediately, while the background tasks continued uninterrupted.

It’s like a colleague who instantly replies to your WeChat messages while also helping you get things done.

The model capabilities of M3 combined with the MiniMax Code Agent team—one responsible for thinking, the other for doing—unlock limitless imagination.

After the storm has passed, everyone's attention has finally returned to M3 itself.

And now, the crucial next step: its weight and the full technical report will be open-sourced within ten days.

At that time, developers worldwide will rate it using real-world projects.

Instantly follow ASI

⭐ Like, share, and click "see more" with one tap ⭐

Star us to lock in instant updates from New智Yuan!

Experts predicted it wouldn't arrive until year-end, but Claude Mythos just ran 3 hours and 6 minutes today!

Next page Article

Inside Anthropic, 95% of business analysis is handled by Claude—the secret isn't a stronger model.

MiniMax M3 Ranks Highest Among Open-Source Models, Igniting Debate in the Chinese Community

New Ze Yuan report

[New Intelligence Yuan Summary] Silicon Valley heavyweights have endorsed it, but the community is in an uproar. Can MiniMax M3 stand up to real-world testing? Developers around the world are already putting it to the test.