Intel FPGA used to hook non-x86 processors to Optane PMem

Intel Stratix FPGA

SMART Modular’s Kestral Optane Memory card connects to non-x86 processors with an Intel Stratix FPGA, which contains Optane controller functions.

Kestral is an add-in PCIe card with up to 2TB of Optane Persistent Memory (PMem), which can be used for memory acceleration or computational storage by Xeon, AMD, Arm or Nvidia processors. Up until now, only Xeon gen 2 scalable CPUs or later could connect to Optane PMem because they contained Optane PMem controller functions without which the PMem could not function as additional memory alongside the host system’s DRAM. The Stratix-10 DX  FPGA used in the Kestral card has the controller functions programmed within it.

Blocks & Files asked SMART Modular’s solutions architect, Pekon Gupta, some questions about how the Kestral card interfaces to hosts, and the FPGA interface for Optane PMem was revealed in his answers.

Re: memory expansion for X86, AMD, Arm, and Nvidia servers, what software would be needed on the servers for this?

Pekon Gupta: Kestral exposes the memory to the accelerator as a memory mapped I/O (MMIO) region. Applications can take benefit from this large pool of extended memory by mapping it to their application space. A standard PCIe driver can enumerate the device. Intel released specific PCIe drivers which can be built on any standard Linux distribution.

What mode of Optane PMem operation (DAX for example) is supported?

Pekon Gupta: The current version of Kestral supports Intel Optane PMem in Memory mode only. There are plans to [add] app direct mode in future revisions.

How is the Optane memory connected to the memory subsystems on the non-x86 servers which don’t support DDR-T? As I understand it, DDR-T is Intel’s protocol for Xeons CPUs (with embedded Optane controller functions) to talk to Optane PMem. Arm, AMD, and Nvidia processors don’t support it.

Pekon Gupta: The Intel FPGA controller on Kestral Add-in-card converts PCIe protocol to the DDR-T protocol. Therefore, the host platform only needs a 16-lane wide PCIe bus. The host never interacts with Optane PMem directly using any proprietary protocol. This card also supports standard  DDR4 RDIMMs, so if the end user needs better performance, they can replace Intel Optane PMem DIMMs with 256GB DDR4 DIMMs.

Why is SMART specifying “Possibly CCIX coherent attached Optane”? What is needed for the possibility to become actual?

Pekon Gupta: This card was designed for a targeted customer, when CCIX was still around. Therefore, there is a possibility of adding CCIX support using third-party IP on the FPGA.  There are still a few ARM-based systems in market which support CCIX Home Agent. These systems can benefit by using a large pool of memory expansion available through this card.

Will PCIe gen 5 be supported?

Pekon Gupta: Not in the current generation of hardware. The current FPGA controller can only support PCIe Gen 4.0.

Intel Stratix-10 FPGA
Intel Stratix-10 FPGA

Will CXL v1.1 be supported ?

Pekon Gupta: The current generation of hardware does not support CXL. The current FPGA controller does not support CXL. However, most of our customers are asking for CXL-2.0 support and beyond so we are considering that for future revisions.

Will CXL v2.0 be supported?

Pekon Gupta: CXL is not supported in the current generation of Kestral. We are in discussion with multiple CXL controller suppliers to build a CXL based add-in-card. And would like to hear from interested customers which features they would like to see in future versions of Kestral or similar CXL-based accelerators.

Memory Acceleration or Storage Cache – how can Optane-based Kestral (with Optane slower than DRAM) be used for memory acceleration?

Pekon Gupta: Although Optane is slower than DRAM, by offloading certain fixed functions onto the FPGA this compensates for the latency by bringing compute next to the data.

Most of the architectures today support bringing (copying) the data near the compute engine (CPU or GPU), but if the data is in 100s of Gigabytes it may be more efficient to bring “some” fixed compute functions near the data, to prefilter the data and re-arrange it. This is exactly the same concept used in “processing in memory (PIM)” technology like the ones shown in Samsung’s AXDIMM. Kestral is just an extension of PIM concept.

When modern x86 servers support Optane PMem, why would you need a Kestral Storage Cache? Is it for non-Optane PMem-supporting servers?

Pekon Gupta: There are two benefits which are seen by both Intel and non-Intel platforms.

1) By using the Kestral Add-in-Card you can attach the Optane PMem to PCIe slots, which frees up DDR4 or DDR5 DIMM slots for adding more direct-attached high speed memory.

2) Kestral hides the nuances of proprietary protocol of Optane DIMM and allows users to focus on using the memory, or adding custom hardware engines to accelerate their workload.

Host server offload functions such as compression, KV compaction and RAID 1 are mentioned as use cases. This, as I understand it, is sometimes called compute-in-storage and also a function of some smart NICs. Putting these functions in a Optane PMem + Arm + FPGA card seems overkill. What advantage does Optane bring here? Is it speed?

Pekon Gupta: Compute in Storage or Processing in Memory are two sides of the same coin. The idea is to offload a “few fixed” compute functions near to the large pool of data so that the host does not need to copy large [amounts of] data and then filter and discard most of it. Optane brings the advantage of a large pool of memory. A single Optane DIMM supports 512GB of capacity and four 512GB DIMMs gives total of 2TB of capacity per Kestral card. So PIM and Computational storage will become cost effective only when with a large density of Memory or storage. This high density of DIMM keeps $$/GB at acceptable levels from cost point-of-view.

Can you provide numbers to justify Optane PMem use?

Pekon Gupta: We have a few benchmark data [points] which can be shared under NDA, as there are some proprietary implementations in it.  But we were able to achieve similar performance as what Optane DIMM will give when directly attached to the processor bus. So we concluded that PCIe is not the bottleneck here.