Impossible becomes possible: MemVerge supporting distributed HPC app checkpointing

Memory virtualizer MemVerge is supporting the checkpointing of distributed, multi-threaded high-performance computing jobs by allying with the DMTCP Project.

Checkpointing is the saving of an application’s state so that it can be restarted if the system fails. Saving the state of a single application is a well-understood technique, but saving the collective state of an application that is distributed across several compute nodes and also multi-threaded is vastly more difficult. MemVerge says that checkpointing is almost impossible for complex distributed HPC apps with massive datasets. The open-source Distributed MultiThreaded Checkpointing Project (DMTCP) has achieved this.

Mark Nossokoff, senior research analyst at Hyperion Research, provided a statement: “Bringing checkpointing capability to big memory architectures with pooled, distributed memory across multiple nodes operating on large datasets should further enable adoption of in-memory computing techniques within the HPC and AI communities. Kudos to MemVerge for stepping up to provide the industry stewardship to make DMTCP a commercial reality.”

DMTCP transparently checkpoints a single-host or distributed computation in user-space — with no modifications to user code or to the OS. It works on most Linux applications, including Python, Matlab, R, GUI desktops, MPI, etc. It’s usable for workloads including VLSI circuit simulators, circuit verification, formalisation of mathematics, bioinformatics, network simulators, high energy physics, cybersecurity, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and general high performance computing (HPC).

The MemVerge-DMTCP partnership will facilitate DMTCP’s move into the market and includes:

  • MemVerge developers joining the DMTCP Project and contributing to open-source development; 
  • MemVerge providing commercial support for the open-source DMTCP software; and 
  • MemVerge integrating the fully tested and supported version into application-specific Big Memory Solutions.

MemVerge has started collaborating with the National Energy Research Scientific Computing Center (NERSC) to optimise MPI-Agnostic Network-Agnostic (MANA), a plugin on top of DMTCP that has been used for transparent checkpointing of MPI (Message Passing Interface) on the Cori and Perlmutter supercomputers.

NERSC DMTCP diagram.

NEERSC documentation states: “DMTCP implements a coordinated checkpointing, as shown in the figure below. There is one DMTCP coordinator for each job (computation) to checkpoint, which is started from one of the nodes allocated to the job, using the dmtcp_coordinator command. Application binaries are then started under the DMTCP control using the dmtcp_launch command, connecting them to the coordinator upon startup. For each user process, a checkpoint thread is spawned that executes commands from the coordinator (default port: 7779). Then, DMTCP starts transparent checkpointing, writing checkpoint files to the disk either periodically or as needed. The job can be restarted from the checkpoint files using the dmtcp_restart command later.”

Charles Fan, CEO of MemVerge, said: “Distributed checkpointing is a perfect complement to ZeroIO In-Memory Snapshot technology that MemVerge has pioneered. We look forward to collaborating with the DMTCP community on future technology and market development.”

Gene Cooperman, a professor at Northeastern University, and leader of the DMTCP Project, said: “The collaboration among NERSC/LBNL, MemVerge, and the DMTCP open-source community will bring reliable and efficient transparent checkpointing to MPI (and later to CUDA) for the production market. While DMTCP and MANA will always remain free and open source, the use of MemVerge technology for rapid writing of memory to stable storage will bring an important enhancement to this technology.”

The MemVerge-DMTCP partnership should enable MemVerge to make more progress in selling its Big Memory technology to HPC customers — particularly ones using DMTCP code.