AI: From Training to Inference

The AI conversation is often framed too simply. Either this is a bubble inflated by chips, data centers, and hype, or it is the early stage of a much larger software cycle. The evidence suggests a more useful interpretation: AI is not ending at the model layer. Its center of gravity appears to be shifting from training frontier models toward running them cheaply, reliably, and repeatedly inside products and workflows.

Stanford HAI’s 2025 AI Index reports that the cost of querying a system at roughly GPT-3.5 quality fell from $20 per million tokens in November 2022 to $0.07 by October 2024, while McKinsey reports that 78% of organizations surveyed were already using AI in at least one business function. That combination matters. When capability improves while inference becomes dramatically cheaper, AI stops being only a research story or an infrastructure story. It becomes a deployment story.

The strategic question begins to change — from who can train the biggest model to who can turn intelligence into software that is useful, trusted, and economically repeatable. Menlo Ventures estimates that enterprise spending on generative AI reached $37 billion in 2025, with $19 billion — more than half — going to the application layer. Deloitte, meanwhile, projects that inference workloads will make up roughly two-thirds of AI compute in 2026, up from about one-third in 2023 and half in 2025.

The First Phase Was Necessarily About Training

The first visible phase of generative AI was always likely to be infrastructure-heavy. Large-scale training created the breakthrough moment, so attention naturally concentrated on foundation models, hyperscale compute, GPU access, and the economics of training. In platform terms, this was the buildout phase: the period when the underlying engine had to become viable before a broader commercial ecosystem could form around it. Stanford HAI’s 2025 report still emphasizes model progress, hardware efficiency, and the narrowing gap between open-weight and closed models.

That pattern is not unusual in technology history. The internet first expanded through networks and protocols before value compounded in web services. Cloud scaled first through infrastructure before SaaS became the dominant commercial layer. Smartphones were initially framed around devices and operating systems before the app economy became the larger economic story. AI now appears to be following the same rhythm: foundational capability first, application expansion later.

Why Inference Changes the Economics of the Market

Training creates capability, but inference is where capability is delivered. Inference is the operational act of using a trained model to answer, classify, summarize, generate, or assist inside a real workflow. That makes it the point where model intelligence becomes product experience, operational output, and revenue or productivity impact.

This is why the collapse in inference cost matters so much. Stanford HAI reports that the cost of querying a GPT-3.5-level model fell more than 280-fold between November 2022 and October 2024. The same report says hardware costs have been falling by about 30% per year while energy efficiency has improved by roughly 40% annually. It also notes that open-weight models narrowed the performance gap with closed models from 8% to 1.7% on some benchmarks in a single year.

The practical implication is straightforward: lower inference cost makes more business models viable. It allows higher query volumes, broader deployment, and more AI-native features inside software products with better unit economics. The competition is no longer only about who trained the most advanced model. It is increasingly about who can serve useful intelligence with the right mix of latency, throughput, reliability, and cost.

Compute Demand Is Also Shifting Toward Inference

One of the clearest signs of this transition is the changing composition of AI compute. Deloitte projects that inference workloads will account for roughly two-thirds of all AI compute in 2026, up from about one-third in 2023 and about half in 2025. Deloitte also projects that the market for inference-optimized chips will exceed $50 billion in 2026.

That forecast matters because it suggests that the larger recurring economic event may increasingly be the use of models rather than only their creation. Training will remain strategic, but if inference takes the dominant share of compute demand, then commercial attention naturally shifts toward serving models efficiently in production. In other words, the market starts to care less only about who trained the model and more about who can deliver low-latency, cost-effective, real-time intelligence at scale.

The Application Layer Is Already Absorbing More of the Spend

The same shift is visible in enterprise spending. Menlo Ventures estimates that enterprises spent $37 billion on generative AI in 2025, and that $19 billion of that — more than half — went to the application layer. Menlo further divides that application spend into departmental AI, vertical AI, and horizontal AI products.

That is important because spending is often the clearest indicator of where value is actually being captured. Once the application layer begins taking the largest share of spend, the market is moving beyond fascination with raw capability and toward software that solves concrete problems. That is consistent with how prior platform cycles evolved: once the infrastructure becomes capable and accessible enough, value tends to migrate upward toward products built on top of it.

Enterprise Adoption Points in the Same Direction

McKinsey’s 2025 State of AI research found that 78% of respondents said their organizations use AI in at least one business function, up from 72% in early 2024 and 55% a year earlier. McKinsey also found that workflow redesign had the biggest effect on whether organizations were seeing EBIT impact from generative AI.

That is a meaningful signal. It suggests that enterprise value is increasingly determined not by model novelty alone, but by how well AI is embedded into actual work. The question is shifting from “Can the model do something impressive?” to “Can the organization deploy it in a way that changes how work gets done?” In enterprise software, that usually means integration, reliability, governance, usability, and measurable outcomes matter at least as much as abstract model performance.

NVIDIA’s Recent Moves Reinforce the Inference Thesis

Recent NVIDIA developments are especially revealing because NVIDIA has been the clearest symbol of the AI infrastructure boom. On December 24, 2025, Groq announced that it had entered into a non-exclusive licensing agreement with NVIDIA for Groq’s inference technology, with founder Jonathan Ross, president Sunny Madra, and other team members joining NVIDIA to help scale the licensed technology. Reuters similarly described the arrangement as a licensing-and-talent deal rather than a straightforward acquisition.

That distinction actually strengthens the broader point. NVIDIA already dominates much of the training-centered layer of AI infrastructure, yet it still chose to deepen its position in inference through dedicated technology licensing and team absorption. Reuters highlighted Groq’s SRAM-based architecture and its relevance to real-time inference — which fits the broader industry shift from training-heavy competition toward low-latency serving and inference performance.

NVIDIA’s Chip Roadmap Points in the Same Direction

In March 2025, NVIDIA introduced Blackwell Ultra and explicitly framed it as paving the way for the “age of AI reasoning,” saying it boosts both training and test-time scaling inference to support reasoning, agentic AI, and physical AI workloads. NVIDIA’s technical materials also position Blackwell Ultra around large-scale reasoning inference and better total cost of ownership.

By March 2026, Reuters reported that NVIDIA was leaning even harder into inference as a major revenue opportunity, outlining a strategy to compete more aggressively in the fast-growing market for running AI systems in real time. Reuters also reported that NVIDIA showcased a system using Groq technology alongside its next-generation roadmap at GTC 2026. Taken together, these moves suggest that the competitive frontier is no longer only about training larger models. It is increasingly about serving larger contexts, lowering latency, improving inference economics, and supporting reasoning-intensive workloads in production.

So Is This a Bubble?

At the company level, parts of the AI market may indeed prove overbuilt. That would not be surprising. In major technology transitions, not every infrastructure supplier survives, not every hardware player wins, and not every early valuation holds. That was true in earlier cycles too: many internet infrastructure firms disappeared even as the internet transformed the economy; many mobile hardware players faded even as smartphones became foundational.

From a broader market perspective, the more defensible conclusion is not that AI is free of excess. It is that the technology wave itself does not look like a short-lived illusion. The pattern now visible — heavy infrastructure investment first, followed by falling usage costs, rising deployment, greater application-layer spend, and growing workflow integration — is exactly what one would expect if a platform technology were moving from buildout into commercialization.

Conclusion

A more grounded interpretation of the current moment is this: the first visible phase of AI was about proving that large-scale training could create extraordinary new capability. The next phase is about proving where that capability belongs — in software, in workflows, in domain-specific tools, and in products people use repeatedly.

That is why the better question is not simply whether AI is a bubble. A better question is whether we are watching a familiar technology cycle move into its next natural phase. The evidence increasingly suggests yes. Inference is becoming dramatically cheaper. Inference workloads are taking a larger share of compute. Application-layer spending is rising. Enterprise adoption is broadening. And even NVIDIA’s recent strategy and product roadmap are leaning more heavily into inference, real-time serving, and reasoning infrastructure.

So the strongest conclusion is not that every company in the current AI stack will emerge a winner. But at the aggregate level, this does not look like the collapse of a hollow theme. It looks much more like a normal platform progression: first the infrastructure is built, then the broader market forms around the software and applications that use it.

By that standard, AI looks less like a fading bubble than a maturing platform entering the stage where inference, applications, and real-world use cases begin to matter more than the original race to train the models.

Sources

[S1] Stanford HAI, The 2025 AI Index Report. Used for inference-cost decline, hardware-cost and energy-efficiency trends, open-weight versus closed-model convergence, and overall AI usage growth.

[S2] McKinsey, The State of AI: How Organizations Are Rewiring to Capture Value (2025). Used for enterprise AI adoption and workflow-redesign findings.

[S3] Menlo Ventures, 2025: The State of Generative AI in the Enterprise. Used for enterprise generative-AI spending and application-layer share.

[S4] Deloitte, More Compute for AI, Not Less (2026 TMT Predictions). Used for projected inference share of AI compute and the inference-chip market outlook.

[S5] Groq newsroom, Groq and Nvidia Enter Non-Exclusive Inference Technology Licensing Agreement (Dec. 24, 2025). Used for the official characterization of the NVIDIA–Groq deal.

[S6] Reuters, Nvidia, Joining Big Tech Deal Spree, to License Groq Technology, Hire Executives (Dec. 24, 2025). Used for independent reporting on the Groq deal structure and inference positioning.

[S7] NVIDIA Newsroom, NVIDIA Blackwell Ultra AI Factory Platform Paves Way for Age of AI Reasoning (Mar. 18, 2025). Used for Blackwell Ultra’s positioning around reasoning and test-time scaling inference.

[S8] NVIDIA Technical Blog, NVIDIA Blackwell Ultra for the Era of AI Reasoning. Used for Blackwell Ultra’s inference-oriented positioning and TCO framing.

[S9] Reuters, Nvidia Bets on AI Inference as Chip Revenue Opportunity Hits $1 Trillion (Mar. 16, 2026). Used for NVIDIA’s current inference-focused strategic framing and GTC 2026 positioning.

Curious how HeyHR works for your team?

Talk to us about your HR and payroll challenges — we'd love to help.