BaM

BaM – Block access Memory provides high-level abstractions for accelerators [GPUs] to make on-demand, fine-grained, high-throughput access to storage while enhancing the storage access performance. To this end, BaM provisions storage I/O queues and buffers in the GPU memory.

It was developed by a collaboration of industry engineers and academic researchers at NVIDIA, IBM, the University of Illinois Urbana-Champaign, and the University at Buffalo. The approach maximizes the parallelism of GPU threads and uses a user-space NVMe driver, helping to ensure that data is delivered to GPUs on demand with minimal latency.

Logical view of BaM design.

A research paper, GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture, says BaM stands for Big accelerator Memory, and it “features a fine-grained software cache to coalesce data storage requests while minimizing I/O amplification effects. This software cache communicates with the storage system through high-throughput queues that enable the massive number of concurrent threads in modern GPUs to generate I/O requests at a high-enough rate to fully utilize the storage devices, and the system interconnect.”

It goes on to say: ”The GPUDirect Async family of technologies accelerate the control path when moving data directly into GPU memory from memory or storage. Each transaction involves initiation, where structures like work queues are created and entries within those structures are readied, and triggering, where transmissions are signaled to begin. To our knowledge, BaM is the first accelerator-centric approach where GPUs can create on-demand accesses to data where it is stored, be it memory or storage, without relying on the CPU to initiate or trigger the accesses. Thus, BaM marks the beginning of a new variant of this family that is GPU kernel initiated (KI): GPUDirect Async KI Storage.”

There is a BaM software cache in the GPU’s memory. The research paper discusses GPU thread data access to it, and says: “If an access hits in the cache, the thread can directly access the data in GPU memory. If the access misses, the thread needs to fetch data from the backing memory [NVMe SSDs]. The BaM software cache is designed to optimize the bandwidth utilization to the backing memory in two ways: (1) by eliminating redundant requests to the backing memory and (2) by allowing users to configure the cache for their application’s needs.”

To fetch data from the NVMe SSDs: “The GPU thread enters the BaM I/O stack to prepare a storage I/O request , enqueues it to a submission queue , and then waits for the storage controller to post the corresponding completion entry. The BaM I/O stack aims to amortize the software overhead associated with the storage submission/completion protocol by leveraging the GPU’s immense thread-level parallelism, and enabling low-latency batching of multiple submission/completion (SQ/CQ) queue entries to minimize the cost of expensive doorbell register updates and reducing the size of critical sections in the storage protocol.”

A doorbell register is a signaling method to alert a storage drive that new work is ready to be processed.

It says BaM is implemented completely in open source, and both hardware and software requirements are publicly accessible.