There is a statistic the AI consulting industry loves to cite: the overwhelming majority of enterprise AI projects fail to deliver on their promise. RAND found that over 80% of AI projects fail, roughly twice the failure rate of non-AI IT projects. MIT's Project NANDA put it more starkly still, finding that 95% of enterprise generative-AI pilots delivered no measurable return. The slide is always the same: a funnel, a scary number, a call to action.
The framing is backwards.
The ones that fail do so cheaply. They die in sandboxes. The team learns something, moves on, and the invoice is a few months of engineering time and a pilot agreement. Expensive, maybe. Catastrophic, no.
The ones that reach production are a different story. They're the ones that inherit everything the industry doesn't put on the slide: pricing structures that cannot be real, model lifecycles measured in months, and a class of operational cost that nobody has figured out how to manage at scale yet.
The ones that succeed are the ones you should be worried about.
The 18-month clock that starts the day you ship
GPT-4o has already been retired. Not deprecated, retired. OpenAI pulled it from ChatGPT in February 2026, and on Azure the GPT-4o API endpoints went dark on 31 March 2026. Standard deployments that didn't migrate in time were auto-upgraded to GPT-5.1 on 9 March, ready or not.
The migration was not a version bump. GPT-5.1 handles system messages differently, enforces stricter JSON schemas, and OpenAI's newer API surface (Responses) supersedes the older Assistants/Threads architecture entirely. The teams that pinned to GPT-4o and never tested forward faced the hardest migrations. The phenomenon even earned a name: prompt drift, the re-testing required because a prompt that worked perfectly on one model behaves differently on the next.
And it isn't over. Microsoft's schedule has the Assistants API, which entire product architectures were built on, shutting down in August 2026. Azure now publishes a rolling model-retirement calendar, which tells you everything: the sensible move is to treat model deprecations like any other dependency risk, track them in CI, and assume the next migration is always coming.
That is infrastructure you didn't budget for, maintaining software you didn't write, on a timeline someone else controls. Every vendor will optimise for economics over sentiment, and GPT-4o was the proof.
The price you're paying isn't real
Here is what the token pricing you're building your unit economics on actually represents:
Leaked Microsoft financial documents, reported by TechCrunch in November 2025, showed OpenAI spent $8.67 billion serving its models through inference on Azure in just the first three quarters of 2025 (more than double its 2024 inference spend), for a cumulative $12.43 billion between CY2024 and Q3 2025. Over the same nine months, the 20% revenue-share payments back to Microsoft implied OpenAI's total revenue was only around $4.33 billion. In other words, the inference bill alone ran to roughly double the revenue.
Those losses aren't an accident of growth. They're a strategy. The major providers are pricing inference below the cost of serving it to capture market share before consolidation. Independent forecasts project OpenAI's cumulative losses running into the tens of billions over the coming years. Much of the spend, too, is non-cash cloud credits that don't show up cleanly in published figures.
Even the headline revenue figures are contested. As Anthropic moved toward an IPO in 2026, the question of how it books revenue came to a head: it reports sales through AWS and Google on a gross basis, counting the full end-customer spend (including the cloud partner's cut) as its own revenue. OpenAI publicly accused Anthropic of overstating its annualised revenue by roughly $8 billion through that method. Both treatments are permissible under US GAAP, which is the point: the top-line numbers the industry quotes depend on accounting choices, not just demand.
API prices have fallen 40–70% since 2024. That is not because it got cheaper to run the models. It's because every major provider is fighting for lock-in before market consolidation, funded by capital that has not yet asked for a return.
You are building your production cost models on a price that is subsidised by billions of dollars in venture capital and strategic investment. At some point, those investors will want their money back.
The question is not whether prices will normalise. The question is what your application's unit economics look like when they do.
FinOps for AI is an unsolved problem
Traditional cloud cost management has a foundation: tag your resources, allocate costs to teams and products, set budgets, get alerts. It works because cloud costs are tied to persistent resources: a server, a database, a network endpoint. You can tag those. You can reason about them.
An API call has no resource to tag. It's a transaction, not an asset.
CloudZero's analysis puts average monthly AI spend at roughly $85,500 in 2025, up about 36% year-on-year, with the share of companies planning to spend over $100,000 a month more than doubling, from 20% to 45%. The spend is climbing fast, and the tooling to govern it hasn't caught up. The fundamental reason is structural: allocating AI costs to teams, products, or features requires capturing metadata at the application layer and propagating it through to billing data. That's an engineering problem most organisations haven't solved.
It gets worse when you move from chatbots to agents. Gartner estimates agentic systems consume 5 to 30 times more tokens per task than a standard chatbot, because every reasoning step re-sends the growing context on each tool call.
The arithmetic is unforgiving on a frontier model. Claude Opus 4.8 runs $5 per million input tokens and $25 per million output. In a long agentic coding loop, the context re-sent on each step routinely passes 50,000 tokens, so a single late-loop step can cost $0.25 or more before output, and a real task chains dozens of those steps. A run that the old guidance pegged at a few dollars now lands well into the tens, and a single sprawling task can clear $100 if nothing is trimming the context it drags along. An engineering decision that looks purely technical (prompt design, context window length, output verbosity) is now a financial one.
Gartner predicts inference costs will fall more than 90% by 2030, and warns that the savings won't reach enterprises. Agentic systems consume 5–30x more tokens per task than a chatbot, so token demand rises faster than token prices fall. The efficiency gains get absorbed by the complexity of what you're asking the model to do.
The organisations that are navigating this are the ones that treated cost observability as a first-class engineering concern from day one, not a finance problem to solve after shipping. Most didn't.
Who's actually winning
While the model companies race to prove they can reach profitability, the infrastructure layer is printing money.
Microsoft, Google, Amazon, and Meta have pledged approximately $700 billion in AI-related capital expenditure for fiscal 2026, a 60% jump from 2025. AWS Bedrock customer spending jumped 60% quarter-over-quarter in Q4 FY2025. NVIDIA's Blackwell GPU delivers 65x more tokens per second than Hopper, at 35x lower cost per token, and is still supply-constrained because demand is that high.
The hyperscalers are not losing money on AI. They are the cost that every model company is paying. OpenAI's revenue share with Microsoft flows in both directions. AWS takes a cut of every Bedrock inference call. Google Cloud captures margin on every Claude API request routed through their infrastructure.
The model companies are the visible face of the AI economy. The cloud providers are the landlords.
The question to ask before you ship
The AI projects that fail are not the expensive ones. They're the learning ones. The expensive ones are the projects that reach production, build up internal dependencies, tune workflows around a specific model's behaviour, and then face one or more of the following:
- A model retirement that forces a code migration on two weeks' notice
- A token price normalisation that breaks the unit economics the business case was built on
- An agent architecture that scales token costs non-linearly in ways nobody modelled
- An infrastructure layer where traditional cost allocation tools simply don't work
None of these are hypothetical. They are all happening now, to teams that shipped successfully.
The question worth asking before you commit to production isn't "will this work?" It's: what does this cost if the token price has to be profitable? What does your architecture look like when the model you tuned against retires in 14 months? Who owns the migration when that happens?
The hype is about the failure rate. The real story is what's waiting for the ones that succeed.
Synaxi trims token waste from every request before it leaves your machine: same outputs, lower costs, no workflow changes. Download it free for Mac.
Sources
Primary sources
- AI project failure rate (over 80%), RAND Corporation
- 95% of enterprise GenAI pilots see no return (MIT Project NANDA), Fortune
- Retiring GPT-4o and older models, OpenAI
- Azure model retirements policy, Microsoft Learn
- Azure model retirement schedule, Microsoft Learn
- Inference costs to fall 90%+ by 2030, savings won't reach enterprises, Gartner
- Claude Opus 4.8 API pricing ($5/$25 per Mtok), Anthropic
- OpenAI's $12B Microsoft inference spend, The Register
- Leaked documents on OpenAI's payments to Microsoft, TechCrunch
Analysis & commentary
- OpenAI revenue and loss projections, FutureSearch
- Leaked OpenAI / Microsoft infrastructure spend, Where's Your Ed?
- Anthropic profitability analysis, Where's Your Ed?
- Anthropic cost structure, Ignacio de Gregorio (Medium)
- AI FinOps token economics, Zylos Research
- FinOps for AI, CloudZero
- GenAI cost optimisation, CloudKeeper
- API pricing drop 40–70%, AI Empire Media
- The $700B AI bet, LongYield
- AWS Bedrock marketplace dynamics, Platform Professional