← The MonexusOpinion

LLM trading models just flunked the longest test in the market

A two-decade backtest published this week finds large-language-model trading strategies rarely beat a simple buy-and-hold. The result is less a verdict on AI than on the strange psychology of the people deploying it.

By Monexus Staff Writer·markets·5-minute read·25 Jun 2026·Live on the wire ↗

The number that should embarrass the machine-learning crowd arrived on 25 June 2026, carried by a single-line wire bulletin at 19:51 UTC: a fresh study, the bulletin reported, found that LLM-based trading strategies mostly failed to outperform a simple buy-and-hold approach over a twenty-year horizon. The framing matters. Two decades is roughly the working life of an asset. It is also long enough to swallow two crashes, one pandemic, one rate-hike cycle measured in basis-point torrents, and the entire smartphone era. If an artificial-intelligence trading system cannot beat an index fund across that window, the failure is not a bug. It is the product.

The temptation, reading the headline, is to declare the technology overhyped. That reading is too easy and probably wrong. The more interesting story is what the result reveals about the people, not the tools — about the strange pull that automation exerts on retail and professional investors alike, and about the way a generation of market participants has come to confuse verbosity with edge.

What the study is actually testing

A buy-and-hold benchmark is the most undemanding yardstick in finance. Buy a broad index at the start of the period. Do nothing. Do not rebalance. Do not panic in March 2020. Do not chase Nvidia in the summer of 2023. Collect whatever the underlying earns. The study, summarised in the 19:51 UTC wire, asks a precise question: across two decades, did strategies that routed decisions through large language models do better than that?

Mostly, no. That is a striking finding because the inputs available to an LLM are genuinely vast: decades of filings, transcripts, central-bank statements, satellite imagery of car parks, sentiment-scored news flow. A system that can read every 10-K ever filed and still loses to a passive index is not suffering from an information shortage. It is suffering from an attention problem — too many signals, too many plausible narratives, too much incentive to act when doing nothing is the optimal move.

The study does not name specific products, and the wire does not specify the methodology, the asset class, or the sample of strategies tested. That gap matters and is worth flagging before the result calcifies into folklore. A backtest across equities is not a backtest across credit. A strategy that uses an LLM to summarise earnings calls is not the same animal as a system that asks a model to size positions directly. Until the underlying paper is read in full, the headline should be read as a directional signal, not as a verdict.

The cognitive trap the technology sets

The deeper issue is psychological. Markets reward, with punitive clarity, the investors who act least. The biggest single determinant of long-run retail returns is not stock selection, not timing, not factor exposure. It is the willingness to sit still while a portfolio compounds. Anything that increases the surface area of decisions — more dashboards, more signals, more frequent rebalancing prompts — tends, on average, to lower returns. This is not a controversial claim; it has been documented in retail-brokerage data for at least a decade.

Large language models are, among other things, decision-generation engines. They produce fluent, confident, structured reasons to act. A model that summarises the day's macro news will, by default, suggest a tilt. A model that reads an earnings transcript will, by default, surface a thesis. A model that watches a Fed press conference will, by default, recommend a duration call. Each suggestion is plausible. None is free. The cost of acting on any of them is the transaction, the tax event, the opportunity cost of the position closed to fund it, and the small but compounding probability that the investor who acts more often is also the investor who acts worst when it matters.

This is the structural point the headline obscures. The AI did not fail the market. The deployment pattern failed the market. Tools that lower the cost of producing a view will, in aggregate, raise the rate of view-production. In a market where the marginal trade is, on average, a value-destroying trade, that is a serious indictment.

The structural read

Two broader patterns are worth naming. First, the entire edifice of quantitative and now machine-learning trading rests on a premise: that systematic execution can extract returns that discretionary execution cannot. The premise is correct at the institutional scale, where it has been demonstrated for thirty years. It is much shakier at the retail scale, where transaction costs, behavioural slippage, and capital-base constraints erase the edge before it can be collected. The new study sits squarely inside that asymmetry. The technology may work. The user base may not.

Second, this result lands at a peculiar moment in market narrative. US GDP growth was revised sharply higher to 2.1% for the first quarter on the same day the study was reported — a 16:15 UTC wire item that, in another news cycle, would have been the day's economic story. The juxtaposition is instructive. The macro environment is benign enough to forgive a great deal of inattention; it is not so benign that an LLM-augmented retail trader has been printing alpha. Thea gap between a healthy economy and a trader's P&L is, as ever, the trader's own behaviour.

There is also a quieter reading worth considering. Perhaps the LLM strategies that did outperform did so quietly, in front-office risk books that do not advertise. Perhaps the strategies that lost were the ones whose results got press because they were the ones being sold. The dataset may be skewed by selection. The wire does not say.

What it means for the next twelve months

Three practical implications follow, and none of them require anyone to denounce AI. First, the buy-and-hold default remains the correct starting point for any investor who is not running a professional book. Second, any deployment of an LLM in a trading workflow should be measured against a passive benchmark over a multi-year window before anyone is allowed to call it a source of edge. Third, the firms selling AI trading tools to retail will face, fairly soon, a disclosure question: what is your backtested Sharpe ratio over twenty years, and what fraction of your customers achieve it in live trading? The honest answer to the second half is currently small enough to be uncomfortable.

The new study, in other words, is less a funeral for machine learning in finance than a quiet rebuke to the way it is being marketed. The technology will keep getting better. The deployment pattern will keep generating trades. The gap between the two is where the returns used to be.

Desk note: this publication read the 25 June wire as a story about investor psychology first and algorithmic capability second — a framing the wire itself left under-developed.

Wire provenance

This editorial synthesis draws on the following public wire/social posts:

https://x.com/polymarket/status/
https://x.com/polymarket/status/
https://x.com/polymarket/status/