Former xAI researcher Ethan He revealed the true cost structure of video AI training: storing one billion videos requires 5 PB of storage, with monthly storage costs exceeding $100,000; compressed feature data is comparable in size to the original videos, bringing combined monthly storage costs above $200,000; data transfer fees even exceed storage costs. The total estimated data cost alone reaches millions of dollars per month, not including GPU compute expenses. The author notes that the competitive moat for video models lies not in algorithms but in infrastructure—this barrier locks competition into the hands of only a few players, making the industry landscape similar to that of semiconductor wafer fabs.
Author and source: Astronaut Monkey
Regarding AI’s massive spending, the industry is rife with astonishing figures: xAI has spent over $1 billion to build the Colossus supercomputing cluster; OpenAI’s monthly compute bills are reportedly in the hundreds of millions of dollars; and the funds Anthropic has raised in recent financing rounds are almost publicly equated with “GPU hours.”
Almost everyone is talking about hashing power. GPUs have become the universal currency for measuring an AI company’s strength and the most prominent number in every funding announcement.
But recently, I listened to an episode of the Latent Space podcast featuring Ethan He, a former researcher at xAI—when Ethan joined xAI in mid-2025, he faced a blank slate with no infrastructure, no data, and no pre-existing models, yet within three months and with a small team, he built the Grok Imagine video generation system from scratch, achieving industry-leading standards at the time.
When discussing the training cost of large video models, he shared a set of figures that suddenly made me realize the industry may have been miscalculating its accounts all along.
Simply storing these videos and feature data costs millions of dollars per month—not even counting the computational costs.
Hidden fees on your bill
How much does it cost to go from zero to one in training a large video model? Even if you assume your team has unlimited access to GPU compute power, you might still underestimate the enormous cost involved.
Suppose you want to train a world-class video generation model and scrape 1 billion videos from the web, with each averaging 5 MB—that’s already a very conservative estimate. Just for this, you would need 5 petabytes of storage. At AWS S3 pricing, 5 PB of standard storage costs approximately $100,000 per month.
But this is still the original video.
Before training video models, the industry standard practice is to first compress videos into latent space feature vectors using a VAE (Variational Autoencoder)—because a video expanded into pixels can contain billions of tokens, which no Transformer can handle; it must first be compressed into continuous vectors that the model can process.
The issue is that this compressed feature data has a size comparable to the original video and requires long-term storage for immediate access.
Combined, the two amount to tens of petabytes, resulting in monthly storage costs exceeding $200,000.
Then comes the most surprising one: data egress/ingress fees.
Ethan said that the bandwidth cost to download one billion videos from the internet on AWS is higher than the cost of storing those videos. Each training run requires pulling the data from the storage layer to the compute layer. Training video models isn’t like training language models, where you train once and finish—it requires iteration, hyperparameter tuning, and testing different data ratios, and each experiment means processing the entire dataset all over again. The more experiments you run, the more this cost multiplies.

Overall, Ethan estimates that just the data component alone costs several million dollars per month—not even including GPU expenses yet.
I’ve never seen any AI industry report break down this cost in detail.
Unaffordable bandwidth fees
Don't companies like xAI, which build their own Colossus data centers, save a significant amount on storage and bandwidth?
Ethan's response was straightforward: "Of course, it saved a lot."
Behind this statement lies a structural secret in the video AI industry that is rarely discussed.
Training data for large language models consists of text, which is relatively lightweight, and once training is complete, the original data has largely fulfilled its purpose—you don’t need to repeatedly fetch the full corpus for inference or fine-tuning. However, video data is entirely different: it is orders of magnitude larger in size, and each training experiment requires processing the entire dataset from start to finish.
The faster the iteration speed, the higher the cost of data movement; Ethan repeatedly emphasized that iteration speed is precisely the most critical variable in video model development.
This creates a mutually reinforcing dilemma: you need rapid iteration to improve model quality, but rapid iteration means frequently moving data—and frequently moving data on a public cloud will overwhelm your bill.
Ethan’s own journey serves as a testament. While working at NVIDIA to build the Cosmos world model, he realized that video models, like language models, follow similar “scaling laws” and still have significant room for improvement. At the time, the surface-level choice he faced was “I need more GPUs,” but an equally critical, unspoken truth was that he needed a place where he wouldn’t be charged by the AWS bill to store and move data. This was one of the primary reasons he joined xAI—and Colossus provided him with that environment.
For teams without their own infrastructure, how does the math add up? Monthly data costs of several million dollars, layered on top of GPU computing power, mean that even with a world-class algorithm team and sufficient funding, as long as you’re relying on public cloud services, you’re essentially racing against competitors’ self-built data centers—while footing an endless bill.
This barrier cannot be crossed by a startup with excellent algorithms relying solely on "technological superiority."
The moat for video models is not the model itself.
This reminds me of an interesting comparison.
In the field of large language models, the competition between open-source and closed-source models has been fierce; the emergence of the Llama series has enabled many small teams to develop competitive language models, even forcing OpenAI and Anthropic to continuously lower their API prices. However, in the domain of video generation, the landscape is markedly different: only teams with substantial resources—such as Sora, Veo, and Keling—are consistently producing top-tier video models, and none have emerged from open-source communities working out of garages.
Many attribute this to a gap in data and computing power. While this is certainly true, the figures revealed by Ethan show that the issue runs deeper: the infrastructure costs for video AI have locked the barrier to entry at a level accessible only to a handful of players from the very beginning.
This is somewhat similar to the logic in the semiconductor industry. TSMC’s dominance is not just due to superior design, but because building a new wafer fab requires tens of billions of dollars in upfront investment—this barrier itself is the best moat. The moat for video AI lies in its dozens of petabytes of data infrastructure and the monthly bandwidth bills it generates.
Ethan also added a deeper implication in the podcast: the "intelligence" of video models largely comes from the underlying language model, not from the video diffusion model itself.
Video diffusion models are relatively "dumb"—they simply generate images exactly as described in the text. If you write “a cat,” it will generate a cat standing motionless against a pure white background, because you haven’t told it what the background should be or what the cat is doing.
What truly understands user intent and expands “a cat” into a detailed, cinematic description is the large language model behind the prompt rewriting. Ethan says that during the Cosmos era, he tested this with “a happy sheep”: without prompt rewriting, the generated image was extremely CGI-like and lacked texture; with rewriting, the result was dramatically improved—yet the underlying video diffusion model itself remained completely unchanged.
This means that what determines how far a company can go in the video AI field is not just the parameter scale of video models, but whether it can simultaneously support both language and video model infrastructures and enable them to work together effectively.
This is a competition of overall physical strength.
The next battlefield has already been drawn.
Of course, the industry is also exploring solutions.
The common logic behind these approaches—rephrasing prompts into an agent-based system, enabling the language model to act like a commander coordinating multiple video generation tools, and using traditional software like FFmpeg to handle intermediate steps—is to layer the computational costs of language model reasoning and video diffusion model generation, making each video generation call more precise and reducing unnecessary computation and data transfer.
Ethan is confident about the trajectory of "video agents." He predicts that by the end of this year, a tipping point will emerge—when videos generated by agents can consistently reach the quality standard required for commercial advertising, businesses will truly be willing to pay for them, and the overall cost structure will evolve accordingly.
But one thing remains unchanged: whoever controls the storage and flow of data controls the starting point of this game.
In the AI space, the "true barriers" shift periodically—first it was parameter count, then training data scale, then alignment techniques, then inference efficiency. Now, video AI is revealing the next barrier—not some mysterious algorithmic breakthrough, but a cold, hard infrastructure bill.
This account was never meant to be affordable for everyone.
Header image source: iMini AI
