Artificial intelligence is knocking on the doors of Wall Street trading rooms, but its track record so far is not impressive.
Early results from a series of public trading competitions show that major large language models (LLMs) generally perform poorly in autonomous trading—most systems incur losses, engage in excessive trading, and make drastically different decisions when given the same instructions. These findings raise a central question: how deep is the gap between LLMs and the actual workings of real markets?
The most representative case comes from the Alpha Arena competition operated by the tech startup Nof1. The competition pitted eight cutting-edge AI systems—including Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, and Elon Musk’s Grok—against each other in four separate rounds. Each AI system received $10,000 in starting capital before each round and was tasked with autonomously trading U.S. technology stocks over a two-week period. In the end, the overall portfolio suffered a loss of approximately one-third, with only six out of 32 outcomes turning a profit.
Jay Azhang, founder of Nof1, bluntly stated: "It's not yet feasible to directly hand over money to an LLM and let it trade on its own."
Competition Results: Losses, Overtrading, and Decision Disagreements
Data from Alpha Arena reveals multiple shortcomings of current LLMs in trading scenarios. Under the same prompt, Alibaba’s Qwen executed 1,418 trades in a single competition round, while the top-performing Grok 4.20 placed only 158 orders. Grok’s best performance occurred in the round where it was able to observe its competitors’ actions.
The AI blog Flat Circle tracked 11 market-related arenas, revealing that at least one model achieved profitability in each arena, but only two arenas had a median model with positive returns, indicating that most models struggle to outperform the market.
The differences in decision-making among the models are also noteworthy. According to Azhang, in the latest round of testing at Alpha Arena, Claude tended to take long positions, Gemini showed no reluctance toward shorting, and Qwen was eager to use high leverage to take on risk. "Each has its own 'personality'—managing them is almost like managing a human analyst," said Doug Clinton, head of Intelligent Alpha, which operates an LLM-driven fund. Informing the models of their existing biases can, to some extent, improve outcomes.
Capability boundaries: LLMs excel at research but are not adept at market timing.
Jay Azhang noted that while LLMs have advantages in researching and invoking the correct tools, they suffer from systemic shortcomings in trade execution: they lack understanding of the relative weights of numerous factors affecting stock prices—such as analyst ratings, insider trading, and sentiment shifts—leading to issues like poor timing of trades, inappropriate position sizing, and excessive trading frequency.
The benchmarking of Intelligent Alpha provides a relatively positive reference. The test granted ten AI models access to financial statements, analyst forecasts, earnings call transcripts, macroeconomic data, and web search capabilities, focusing on judging the direction of earnings forecasts. The results showed that OpenAI’s ChatGPT achieved a 68% accuracy rate in correctly predicting the direction of earnings forecasts for the fourth quarter of 2025, marking its best performance to date. Clinton noted that, with each new version release, the models’ overall performance has shown a trend of improvement.
Methodological Dilemma: Backtesting Fails; Live Testing Becomes the Only Option
Evaluating AI trading capabilities faces a fundamental methodological obstacle. Traditional quantitative strategies rely on historical backtesting to validate effectiveness, but this framework is nearly useless for LLMs—a model asked how to trade the March 2020 market conditions already "knows" how that period unfolded. This contamination, known as lookahead bias, forces researchers to evaluate AI solely through live market performance, giving rise to the proliferation of current benchmarks and arenas.
Jim Moran, a blogger at Flat Circle and co-founder of the former alternative data provider YipitData, believes that most current public experiments are too short-lived and too noisy to support definitive conclusions. These arenas also suffer from inherent disadvantages, such as limited access to proprietary equity research resources and lower execution quality. "If you were to directly transplant one of these AI agents from these arenas into a top-tier hedge fund, its performance would likely be much better," he said.
Industry outlook: Truly effective strategies may quietly disappear from public view.
Alexander Izydorczyk, former Head of Data Science at Coatue Management and currently at NX1 Capital, recently wrote that none of the AI trading bots he tracks have yet demonstrated sustained alpha-generating capability. He believes the limitation of these arenas lies in the absence of practical quantitative techniques used by secretive trading institutions within their training data.
However, Izydorczyk also left a thought-provoking observation: "Beginners sometimes see things that experts miss." He wrote on his personal blog, "You won’t hear anything immediately when LLM-powered trading strategies truly start to work."
Nof1 is preparing for Season 2 of Alpha Arena, planning to equip each AI model with web search capabilities, extended reasoning time, access to more data sources, and multi-step execution abilities. However, the company’s core business model is to provide retail traders with system tools to build AI trading agents—rather than directly deploying AI onto trading floors. This positioning itself may be the most pragmatic commentary on the current capabilities of AI in trading.
