Reducing AI Project Abandonment Means Rethinking AI Infrastructure


TL;DR AI project abandonment isn’t an AI problem. It’s an infrastructure problem. By breaking the memory wall, organizations can cut inference energy use by an order of magnitude, reallocate scarce GPU and memory resources more efficiently, and run continuous, large-scale agent workloads 24/7 — without adding hardware or increasing energy consumption.
As everyone knows, I am an avid – some may say passionate – consumer of AI news. I regularly share dozens of links to reports, new research, blogs, social riff raf, and other findings with our team every day. There is so much excitement and opportunity happening in the AI world right now. From the reinvention of software development, to ontologies and decision traces, the unstoppable force of AI is actually moving the unmovable object of enterprise inertia.
The velocity of this force generates headlines about “the AI bubble”. Whether or when or how it will burst has become a burning (and presumably click-baiting) question among media outlets everywhere. A lot of them are even highlighting dubious reports like those coming out of MIT, that claim to reveal as many as 95% of AI projects are abandoned because they net zero returns.
However, one doesn’t have to look far to realize that the truth, as with so much of AI, is more complicated.
For instance, studies like those from the University of Pennsylvania’s Wharton School reveal that a majority (74%) of AI projects are producing returns. And then there are reports like those from S&P Global that put AI project abandonment rates at near 50%, neatly splitting the data down the middle. What’s really happening here?
I suspect that the project failure rates we’re seeing here are less a product of AI itself and more a reflection of the organization’s maturity. Take that MIT study again. Rather than measuring generic LLMs, it was actually a limited set of anecdotes looking at custom or agentic AI tools that required real-time context windows, which come with a much higher bar for success. Many organizations simply don’t have the infrastructure yet to provide this.
So what will it take? Recently, I had the opportunity to chat with Greg Macatee, Senior Analyst at S&P Global, about how mature organizations are transforming their infrastructure to meet the unique requirements of AI – and the work that’s still left to be done. Here are my takeaways.
What Makes the AI Architecture Challenge So Fundamentally Different
As organizations start to refocus their efforts around AI and build up the infrastructure to support it, one of the first things they need to understand is that the challenge they face is entirely distinct from previous tech transformations, such as the migration to cloud.
Consider this: Whereas the CPUs that once dominated computing have leveled out at around a few hundred cores, the GPUs that power AI started their life with thousands of cores. In fact the latest GPUs from NVIDIA and AMD have more than 20,000 cores (more than 160,000 per server, almost 1.5 million per rack) – all working to execute massive numbers of tasks simultaneously, rather than sequentially. This extreme level of parallelism has changed the equation. To achieve efficiency and ROI, organizations must be able to not just store data, but continuously consume the massive amounts of data that AI demands. Storage for CPUs is data access for GPUs.
In other words, the architectural challenge has now moved from one of “traditional storage” to ephemeral and persistent memory. Successful organizations are ones that build infrastructure which enables memory-like use of the data their AI models need, at the cost of legacy CPU storage.
Continuous Data Usage Requires Overcoming the Memory Wall
Acknowledging the importance of memory in AI infrastructure is the first step. The second is realizing its limitations: current memory bandwidth capacities (tied to memory bandwidth utilization – MBU) cannot keep up with the processing speed of modern GPUs (measured in petabytes of FLOPs).
And with GPU compute power increasing at roughly double the rate of memory bandwidth, the gap between these two metrics will only get worse. But this isn’t abstract – it’s a challenge the AI industry is already encountering. Large models can run into the hundreds of billions of parameters, while those from the likes of OpenAI or Anthropic have long surpassed the one trillion parameter milestone. There simply is not enough memory in the GPUs for model weights and user prompt processing (tensors) to enable continuous data consumption at this level.
The result is what the industry calls the “memory wall,” and its characteristics will be familiar to anyone who has encountered it: slower time-to-first-token (TTFT), rate limits, premium token pricing, and increasing constraints of agentic workloads with concurrent users / sub-tasks, and multi-turn scenarios. In some cases, it may even become necessary to gate end users or agents in order to preserve optimal performance. For many, this can be enough to make them abandon their AI project, which is why overcoming this wall is now so critical.
Break Through the Memory Wall with WEKA’s Augmented Memory Grid
The gap between memory and compute may seem like an intrinsic physical limitation that can’t be overcome. That’s why the solution requires first-principles rethinking how memory in AI infrastructure is packaged and works – which is just what WEKA has done with Augmented Memory Grid™.
To understand how it works, it’s necessary to first review how memory functions within a typical AI model. When an input is processed, it is turned into a sequence of tokens that is stored in the GPU’s high-bandwidth memory (HBM). This is called the Key Value (KV) cache. During the decode phase, these tokens are used to produce the output at a latency low/fast enough to feel fluid. With single-tenant usage and limited inputs, this process should work smoothly.
But problems arise when multiple tenants and more complex, multi-part, concurrent, agent-style inputs exceed HBM limitations. When this happens, previous KV caches are evicted and transferred to higher capacity but slower memory tiers like DRAM. Quite often after five minutes, when a sub-task within an agent swarm attempts to decode that cache, it will find that its context is gone. As a result, the compute-centric energy-intensive prefill phase of inference will have to redundantly and wastefully begin again, so that the KV cache can be rebuilt, considerably slowing the entire process.

This is a graphic I saw during one of RedHat’s vLLM Office Hour sessions that clearly demonstrates scheduling for disaggregated serving. The words in Purple to the right are my addition. Disaggregated inference architecture reduces wasteful GPU prefill work by reusing KV cache across decode workers—lowering energy consumption while enabling continuous, large-scale AI inference. If you are interested in talking about inference at this level, check out their YouTube playlist full of helpful sessions!
Augmented Memory Grid solves this issue by extending the KV cache without limits. It does this by transferring it to a token warehouse™ stored across the GPU’s NVMe flash arrays, across entire fleets of GPU clusters. These data center-scale caches can then be pulled as needed for decoding, providing inference-grade latency alongside multi-petabyte-range memory bandwidth.
What does this mean for organizations?
- Massive energy reductions as GPU prefills are reduced by an order of magnitude
- More efficient reallocation of scarce GPUs, memory (DRAM) and storage (SSD). Reducing the ‘inference cost center’ of GPU prefill processing of input tokens, while maximizing the ‘inference profit center’ of decode, which generates the output tokens.
- Prefill and decode can now execute optimally at unprecedented scale, enabling 24/7 continuous streaming of agents – without any new hardware components or additional energy consumption.
See What Lies Beyond the AI Memory Wall
This only scratches the surface of how Augmented Memory Grid works and what its implications mean for businesses. To get a full understanding of the challenges we face and how WEKA is solving them, take in my full conversation in our webinar: Breaking Down the Memory Wall in AI Infrastructure.