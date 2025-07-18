South Korea’s CXL memory-focused Panmnesia believes that AI clusters need both GPU node memory sharing and fast inter-GPU networking with a combined CXL and UALink/NVLink architecture.

Panmnesia has released a 56-page technical report titled “Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure,” written by CEO Dr Myoungsoo Jung. The report outlines the trends in modern AI models, the limitations of current AI infrastructure in handling them, and how emerging memory and interconnect technologies – including Compute Express Link (CXL), NVLink, Ultra Accelerator Link (UALink), and High Bandwidth Memory (HBM) – can be used to overcome the limitations.

Dr Myoungsoo Jung

Jung stated: “This technical report was written to more clearly and accessibly share the ideas on AI infrastructure that we presented during a keynote last August. We aimed to explain AI and large language models (LLMs) in a way that even readers without deep technical backgrounds could understand. We also explored how AI infrastructure may evolve in the future, considering the unique characteristics of AI services.”

The technical report is divided into three main parts:

Trends in AI and Modern Data Center Architectures for AI Workloads

CXL Composable Architectures: Improving Data Center Architecture using CXL and Acceleration Case Studies

Beyond CXL: Optimizing AI Resource Connectivity in Data Center via Hybrid Link Architectures (CXL-over-XLink Supercluster)

The trends section looks at how AI applications based on sequence models – such as chatbots, image generation, and video processing – are now widely integrated into everyday life. It has an overview of sequence models, their underlying mechanisms, and the evolution from recurrent neural networks (RNNs) to LLMs. It then explains how current AI infrastructures handle these models and discusses their limitations:

Communication overhead during synchronization

Low resource utilization resulting from rigid, GPU-centric architectures

Jung writes in the report that no single fixed architecture can fully satisfy all the compute, memory, and networking performance demands for LLM training, inference prefill and decode, and retrieval-augmented generation (RAG). He suggests the best way to address the limitations is to use CXL, and specifically CXL 3.0 with its multi-level switch cascading, advanced routing mechanisms, and comprehensive system-wide memory coherence capabilities.

Panmnesia has developed a CXL 3.0-compliant real-system prototype using its core technologies, including CXL intellectual property blocks and CXL switches. This prototype has been applied to accelerate real-world AI applications – such as RAG and deep learning recommendation models (DLRMs) – and has proven practical and effective.

Jung then proposes methods to build more advanced AI infrastructure through the integration of diverse interconnect technologies alongside CXL, including UALink, NVLink, and NVLink Fusion, collectively called XLink.

He says “CXL addresses critical memory-capacity expansion and coherent data-sharing challenges.” But there are “specific accelerator-centric workloads requiring efficient intra-accelerator communications” such as “Ultra Accelerator Link (UALink) and Nvidia’s NVLink, collectively termed Accelerator-Centric Interconnect Link (XLink) in this technical report.”

Both CXL and XLink are needed to optimize AI super-clusters: “XLink technologies provide direct, point-to-point connections explicitly optimized for accelerator-to-accelerator data exchanges, enhancing performance within tightly integrated accelerator clusters. In contrast to CXL, these XLink technologies do not support protocol-level cache coherence or memory pooling; instead, their focus is efficient, low-latency data transfers among accelerators with a single-hop Clos topology interconnect architecture.”

He notes: “UALink employs Ethernet-based communication optimized primarily for large-sized data transfers, whereas NVLink utilizes Nvidia’s proprietary electrical signaling, tailored for small-to-medium-sized data exchanges, such as tensor transfers and gradient synchronization between GPUs.”

Panmnesia Technical Report diagram

So “integrating CXL and XLink into a unified data center architecture, termed CXL over XLink, including CXL over NVLink and CXL over UALink, leverages their complementary strengths to optimize overall system performance. … this integration adopts two architectural proposals: i) ‘accelerator-centric clusters,’ optimized specifically for rapid intra-cluster accelerator communication, and ii) ‘tiered memory architectures,’ employing disaggregated memory pools to handle large-scale data.”

Jung then proposes “an extended, scalable architecture that integrates a tiered memory hierarchy within supercluster configurations, explicitly designed to address the diverse memory-performance demands of contemporary AI workloads. This structure comprises two distinct memory tiers: i) high-performance local memory managed via XLink and coherence-centric CXL, and ii) scalable, composable memory pools enabled through capacity-oriented CXL.” The report discusses how these would be deployed in an AI data center, with notes on hierarchical data placement and management.

Download Jung’s technical report here.