AI Trading Performance Mixed in Real Market Tests

AI knows a lot, but it is currently "not reliable."

Author and source: Yang Xia, Yang Xia's Everything Shop

Lately, haven't I been researching and preparing the Agent Trading tool?

After trying numerous AI trading methods, tools, and platforms, and spending hundreds of millions of tokens,

A core insight,

AI knows a lot, but it is currently "not reliable."

I know that many people picked up various financial skills during the earlier lobster craze,

Excitedly preparing to dominate the market,

The noise gradually faded, the lobster was delisted, and it’s now 14 yuan per pound.

How to build trustworthy, executable, and iteratable trading agents in real capital markets

The pitfalls I've encountered over the past few months could fill 100,000 words of firsthand experience.

However, for today, let’s set that aside for now.

Recently, while building knowledge on the AT architecture, I came across a paper that I think is well worth sharing with you.

Especially as everyone is immersed in the paradise lost of AI trading, it is visibly clear that AI will fully participate in investing in the future.

The authors of "AI-TRADER: BENCHMARKING AUTONOMOUS AGENTS IN REAL-TIME FINANCIAL MARKETS" propose the AI-Trader framework to evaluate the financial decision-making performance of leading LLM models in a fully autonomous, real-time, and data-pollution-free environment.

In simple terms, it’s about testing how well the AI performs in stock trading.

The experiment selected three asset pools: Nasdaq-100 components listed in the U.S. market, SSE 50 components listed in China, and the top 10 mainstream cryptocurrency assets, supporting hourly trading frequency (for U.S. stocks) and daily trading frequency (for Chinese stocks and cryptocurrencies).

Different AI models, encapsulated within the same agent, can use MCP to retrieve news, information, financial reports, and market data, autonomously performing sentiment extraction, numerical calculations, and executing trading instructions.

Six participants (before DS-V4 was released),

• DeepSeek-v3.1

• MiniMax-M2

• Claude-3.7-Sonnet

• GPT-5

• Qwen3-Max

• Gemini-2.5-Flash

From November 25 to November 7, the real market competition took place, and the results are as follows:

MiniMax-M2 has won first place in both U.S. stocks (hourly) and A-shares (daily).

DS-V3.1 won first place in the cryptocurrency category.

Yet, cruelly,

Most models perform poorly in real markets, yielding low returns and weak risk management.

These flaws cannot be reflected in benchmark evaluations across major model categories.

The same model performs very differently across different markets.

For example, Champion MINIMAX pursues returns in the U.S. stock market but shifts to a defensive stance in the A-share market (low volatility, low drawdown), indicating that its training data clearly recognizes the differences between the two markets.

In U.S. stocks, multiple models can outperform QQQ.

In the A-share market, none have outperformed the SSE 50—even if Warren Buffett or the most powerful AI arrived, they would still have to bow down in China’s A-share market.

Even DeepSeek, born and bred with a quantitative lineage,

Performs well in U.S. stocks and crypto markets, but still can't hold its own in A-shares.

Gemini in the U.S. trades at an average of $3.79, but on China's A-share market, it's wildly pushed up to $4.74—well, when in Rome, do as the Romans do.

There are some success stories inside.

For example, on October 10, DS used the Search tool to retrieve news about Trump’s announcement of additional tariffs on China, inferred heightened risks for technology stocks, and executed a defensive strategy:

Technology stock allocation decreased from 99% to 70%.

Add essential consumer goods (PEP) and utilities (AEP)

Hold 17.3% in cash

Successfully reduced losses and outperformed most models

Similarly, DS made the same mistake that all AI systems worldwide make,

Got burned by a single source

Received the news about a "structural bull market" without cross-verification

Incorrectly adding positions in traditional energy and banking stocks, missing the market's primary upward trend.

Revealed the agent's shortcomings in information verification and dynamic error correction.

In an environment with a well-designed information interface and aligned data, AI does not commit "hallucination" errors in the general sense.

The real "practical flaw" lies in,

Either misanalysis (false information),

Either frequent trading (ineffective transactions),

Either risk control fails (hits a雷).

These are also several inherent limitations I’ve personally experienced in my AI experiments over the past few months.

However, solutions exist for these issues.

The several authors in the original text,

A website has also been set up specifically to track and develop follow-up human-machine trading collaboration experiments.

You can also directly install their pre-built skill to join a trading competition.