Tech Insight — Memory Architecture

The Memory Shortage Is Not About Memory

AI did not just need more memory. It exposed the need for a different kind of memory — bandwidth-delivered, thermally survivable, and package-integrated.

Kelvin Kwon May 2026

Hero — Public Story vs Engineering Reality The mainstream account describes a quantity problem: not enough memory. The engineering account describes a kind problem: the wrong kind of memory. What AI infrastructure requires is not more DRAM — it is memory that can deliver bandwidth at the package boundary, survive elevated thermal conditions, and integrate where the accelerator’s memory controller can reach it without crossing a latency-heavy path.

01 — The Public Story

The Story Everyone Knows

In 2026, global semiconductor revenue is projected to approach $1 trillion — $975 billion, a 26.3% increase over 2025’s record-setting $772 billion. Memory is the fastest-growing segment: $294.8 billion projected, up 39.4% from $211.6 billion a year earlier. These figures come from the World Semiconductor Trade Statistics autumn 2025 forecast.

The story as most people have encountered it runs like this. AI training and inference consume enormous amounts of memory. Demand has outpaced production capacity. Prices rise. Memory companies positioned in the high-bandwidth supply chain benefit. Governments invest in domestic manufacturing to reduce supply chain risk.

This is an accurate account. It reflects a real demand curve driven by real infrastructure investment.

That story is not wrong. But it stops exactly where the engineering story begins.

02 — Engineering Reality

The Definition Changed

The mainstream account describes a quantity problem: not enough gigabytes at the price the market expects. The engineering account is different. It describes a kind problem — and the distinction matters.

In many autoregressive decode scenarios — the process by which a large language model generates each token sequentially — the GPU is not compute-bound. It is memory-bound. The arithmetic intensity of the operation approaches a regime where moving model weights from memory to the compute die dominates the time spent on actual arithmetic. The GPU waits. Cycles pass. The calculation is not too complex. The data cannot arrive fast enough.

This changes what “useful memory” means at the system level. A petabyte of DDR4 distributed across a data center’s memory bus does not solve the problem. That memory is three hops away from where the computation is happening. By the time its contents traverse the system interconnect, the GPU has stalled and recovered multiple times over.

What AI infrastructure actually requires is memory that can deliver bandwidth at the package boundary — where the compute die lives — survive the elevated thermal conditions that a dense package generates under sustained workloads, and be present precisely where the accelerator’s memory controller can reach it without crossing a latency-heavy path.

AI did not merely ask for more memory. It exposed that only a very specific kind of memory now matters: memory that can deliver bandwidth, survive heat, fit inside a package, and move with the workload.

The problem is not volume; it is kind. And this definitional shift has a structural effect on how memory is manufactured — one that makes both types of shortage worse simultaneously.

03 — The Wafer Paradox

One Pool. Two Shortages.

High-bandwidth memory and conventional DRAM are both manufactured on silicon wafers. That sounds obvious. It is not.

HBM achieves its bandwidth through a wide bus — HBM4 uses a 2048-bit interface — implemented by stacking multiple DRAM dies vertically and connecting them through thousands of through-silicon vias. This geometry is dense. It consumes substantially more wafer area per unit of stored capacity than a conventional DDR5 die. Each gigabyte of HBM requires approximately three to four times the wafer area of a gigabyte of DDR5. The premium arises from TSV interconnect density, thicker die geometry, and smaller array structures at each stacked layer.

The same DRAM wafer pool that supplies DDR5 for servers, laptops, and mobile devices also supplies HBM for AI accelerators. It is a finite pool. When allocation toward HBM increases, the allocation available for DDR5 and LPDDR5X contracts proportionally. Industry analysts estimate that AI-related workloads will consume approximately 20% of global DRAM wafer capacity in 2026. That share was negligible five years ago.

This is easier to see than to describe.

V2 — Wafer Paradox A finite DRAM wafer pool is increasingly split toward HBM, creating simultaneous HBM shortage and mainstream DRAM supply pressure — from the same wafer. As HBM allocation increases (HBM requires approximately 3–4× the wafer area per gigabyte), the pool available for DDR5 and LPDDR5X contracts proportionally. Both shortages share a single root cause.

The structural consequence is a double shortage with a single root cause. More wafer capacity pulled toward HBM means insufficient HBM for AI accelerator demand — because the absolute volume required keeps growing faster than allocation can shift — and insufficient mainstream DRAM for everything else, because the pool for DDR5 has contracted. Two shortages. One wafer pool. Not a demand spike in two separate markets; a single allocation constraint producing pressure in both directions at once.

For contrast: NAND flash operates on a structurally separate production base — different fabs, different wafer pools, different chemistry. H1 2025 saw NAND oversupply conditions. This disconnection is not coincidence. It is a structural consequence of the fact that NAND and DRAM manufacturing do not share the same production constraint.

The problem is not just that AI needs more memory. Getting AI the memory it needs takes capacity away from everything else. And even if that wafer paradox were resolved — even if more HBM capacity came online at scale — a second ceiling is already visible inside the package itself.

04 — The Architecture Ceiling

Specifications Do Not Close the Gap

HBM4, specified in JEDEC JESD270-4 published in April 2025, delivers up to 2 terabytes per second of bandwidth per stack across a 2048-bit bus, with a maximum capacity of 64 gigabytes per stack. This is a substantial advance over HBM3E. It is also not enough.

The pattern across successive AI accelerator generations has been consistent: GPU compute throughput has scaled faster than available memory bandwidth per GPU package. Within each hardware generation, the HBM upgrade partially closes the gap the previous GPU generation opened. When the next compute generation launches, the gap reopens. The absolute bandwidth numbers increase on both sides. The ratio does not converge.

This is not a supply chain lag that will resolve itself. It reflects a structural difference in how compute capability and memory bandwidth scale with process node improvements. Compute benefits directly from transistor density and clock scaling. Memory bandwidth per GPU package is constrained by how many HBM stacks fit in the package, how wide the bus can be, and how fast the interface can run without thermal or signal integrity problems.

↻ Planned: Memory–Compute Divergence Chart

GPU compute throughput and available memory bandwidth per GPU package have both increased across successive accelerator generations — but not at the same rate. The gap that narrows within a generation reopens when the next compute generation launches. Per-GPU-package bandwidth (not per-stack) is the operative metric, as the number of HBM stacks per package has also increased across generations.

To be added when per-GPU-package bandwidth figures across H100 / H200 / B200 are confirmed from sourced data.

The thermal constraint makes this concrete. imec research presented at IEDM 2025 found that, under the stated simulation conditions, 3D HBM-on-GPU integration — where memory stacks are placed directly atop the GPU die — reached GPU temperatures of 141.7°C without active thermal mitigation. The current 2.5D CoWoS architecture, where GPU and HBM stacks sit side by side on an interposer, reached approximately 69°C under equivalent conditions. These are simulation results under specific conditions, not production failure thresholds. But the magnitude of the differential is informative: the thermal gap between current 2.5D packaging and next-generation 3D integration is not a rounding error. It is a physics constraint on how quickly 3D integration can be deployed without co-designed thermal management.

↻ Planned: Packaging Thermal Comparison

imec simulation data (IEDM 2025) under stated test conditions: 3D HBM-on-GPU integration reaches 141.7°C without active thermal mitigation; the current 2.5D CoWoS architecture reaches approximately 69°C under equivalent conditions. The thermal differential is a physics constraint on how fast 3D integration can be deployed, not a logistics or supply chain problem.

To be added when imec source conditions are confirmed as correctly characterized.

There is a quieter signal worth naming. HBM4’s base die — the bottom die in the stack that handles interface logic and control functions — is now manufactured on TSMC’s leading-edge logic process, not on a conventional memory process. That is not a packaging detail. It is evidence that the boundary between memory and compute logic is already dissolving at the die level. Memory is becoming a co-designed component.

Hardware upgrades are happening, but the architecture is hitting a ceiling that specifications alone do not resolve.

05 — Solution Landscape

Three Layers. None Sufficient.

Three layers of response are underway simultaneously. None is sufficient alone. Understanding what each layer does — and what it does not do — is the prerequisite for understanding why the co-design conclusion is not a preference but a structural requirement.

Layer 1 — Supply Expansion

Supply expansion is the necessary floor. SK Hynix has committed approximately $3.87 billion to a HBM packaging and R&D facility in Indiana, with production starting in the second half of 2028. Micron has announced plans to invest more than $100 billion in US memory manufacturing over 20 years. These investments address the wafer capacity constraint at its root. Without them, the shortage deepens structurally as AI accelerator demand continues to compound.

They are also too slow for the current window. Three-to-five year lead times on new fab capacity mean these investments do not relieve 2025 through 2028 demand. More critically, expanded HBM output does not automatically close the bandwidth gap described in Move 4. Whether that HBM fits inside the package at the required thermal and power envelope is a separate co-design question. More square footage of fab floor does not answer it.

Layer 2 — Architecture Upgrades

Architecture upgrades buy time, generation by generation. CXL — Compute Express Link — enables memory pooling across servers, allowing compute nodes to access a shared pool of DRAM over a high-speed fabric. CXL memory pooling introduces latency of approximately 110 to 150 nanoseconds, compared to approximately 60 to 80 nanoseconds for local DDR5; these figures are configuration-dependent and should not be treated as universal constants. At inference workload timescales, that additional latency matters. CXL is a capacity extender, not a bandwidth substitute. It addresses the “not enough memory” problem at the capacity dimension. It does not address the “not enough bandwidth per compute cycle” problem. These are different problems, and conflating them produces architectures that are memory-rich but still bandwidth-starved.

Industry roadmaps point toward HBM4E and HBM5 in the late 2020s, with the usual expectation that each generation will push bandwidth higher. But the pattern from Move 4 continues:

Layer 3 — Algorithm and System Co-Design

Algorithm and system co-design is the decisive layer — and the least certain one. DeepSeek-V2’s Multi-head Latent Attention architecture reduces KV cache memory by 93.3% compared to standard multi-head attention. This is a result specific to DeepSeek-V2’s MLA architecture; it does not generalize as a claim about KV cache compression across all model architectures. But the specific result is real, and it illustrates the class of leverage available at the algorithm level: the same model capability at a fraction of the memory footprint. KV-cache compression and related memory-saving techniques are already moving from papers into production-style inference systems.

Whether these algorithm-level savings are actually reducing HBM purchase orders at hyperscale remains an open engineering question. The alternative is that freed memory capacity is immediately reinvested in larger batch sizes, longer context windows, and more complex model variants — savings that enable more demand rather than relieving existing demand. This question matters for whether Layer 3 acts as a relief valve or as an accelerant. The honest answer is: we do not yet know. Anyone who states a confident position here without citing specific hyperscaler procurement data should be read with skepticism.

No solution sits in the near-term, high-impact position. That absence is itself the finding.

V4 — Solution Landscape Map No single solution closes the compute-memory gap within the critical 2025–2028 window. Supply expansion (SK Hynix Indiana, Micron US manufacturing) is necessary but too slow. Architecture upgrades (HBM4/4E, CoWoS expansion, CXL pooling) narrow the gap per generation without closing it. Algorithm co-design (KV cache compression, speculative decoding, memory hierarchy redesign) acts on both supply and demand simultaneously — but whether those savings reduce demand or enable larger models consuming the savings remains an open question. The absence of any solution in the near-term, high-impact position is itself the argument.

What this structure implies is not a list of solutions to be implemented in parallel. It implies a different organizing principle for how memory is thought about in the first place.

06 — The Conclusion

Co-Designed Memory Infrastructure

The shortage is not solved by adding more memory. This is the claim the mainstream narrative has not yet fully absorbed.

Adding more HBM helps — conditionally. It helps if that memory fits inside the package at the thermal envelope the accelerator generates under sustained inference load. It helps if the electrical delivery infrastructure can supply the power the additional stacks require without compromising signal integrity at the bus. It helps if the system software stack knows how to schedule workloads to the memory tier that serves them most efficiently. Each of those conditions is a co-design problem. None is a procurement problem.

Memory is already becoming an infrastructure layer. The evidence is not speculative — it is visible in engineering decisions already made. HBM4’s base die is manufactured on a leading-edge logic process, not a memory process; that die boundary is already a logic-memory hybrid. CXL is building a tiered memory hierarchy at the system level. KV-cache compression and related memory-saving techniques are already moving from papers into production-style inference systems. These are not predicted transitions. They are the ones already underway, already in silicon, already in production deployments.

The framing that memory is a commodity — purchased by the gigabyte, measured by cost per bit, treated as a pooled fungible resource — has not been wrong historically. It has become wrong now, at the layer where AI infrastructure meets the physical constraints of what a package can contain and what a bus can carry.

Closing — Negotiation with Physics Memory in AI infrastructure is not a passive resource. It is under simultaneous pressure from five physical and system constraints — bandwidth delivery, thermal path, package area, power budget, and workload behavior. Co-design is not a preference. It is the only architecture that satisfies all five simultaneously.

Sources & Notes

WSTS Autumn 2025 Forecast [T1 Official] — World Semiconductor Trade Statistics, Autumn 2025 market forecast. Global semiconductor revenue projections: $772B (2025), $975B (2026). Memory segment: $211.6B (2025), $294.8B (2026).
JEDEC JESD270-4 [T1 Standard] — JEDEC Standard for High Bandwidth Memory (HBM4), published April 2025. Interface width: 2048-bit. Peak bandwidth: up to 2 TB/s per stack. Maximum capacity: 64 GB per stack.
LLM Roofline / Arithmetic Intensity [T2 Research] — arxiv:2402.16363. Basis for the claim that autoregressive decode operates in a memory-bandwidth-dominated regime under many workload configurations.
HBM Revenue Share [T2 Industry] — KOTRA citing IDC industry estimates. HBM share of DRAM revenue: approximately 18% (2024), approximately 30% (2025 estimate).
DRAM Wafer Capacity [T2 Industry] — TrendForce industry estimate. AI-related workloads: approximately 20% of global DRAM wafer capacity in 2026.
imec IEDM 2025 [T2 Research] — imec research on 3D HBM-on-GPU thermal behavior presented at IEDM 2025. Figures cited (141.7°C and 69°C) are simulation results under stated test conditions; not production failure thresholds.
DeepSeek-V2 MLA [T2 Research] — arxiv:2405.04434. DeepSeek-V2’s Multi-head Latent Attention architecture. KV cache reduction result (93.3%) is specific to DeepSeek-V2 MLA and does not generalize as a universal KV cache compression claim.
Investment Figures [T2 Official] — SK Hynix Indiana HBM facility: ~$3.87B commitment, H2 2028 production start (SK Hynix official announcement). Micron US manufacturing: $100B+ over 20 years (Micron official announcement).
CXL Latency [T2 Industry] — CXL Consortium technical specifications and supporting research. Latency range (110–150 ns for CXL pooled memory vs. 60–80 ns for local DDR5) is configuration-dependent.