Главная
Study mode:
on
1
iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Learn about a novel checkpointing framework called iCheck in this technical presentation from researchers at the Technical University of Munich. Explore how RDMA and malleable multilevel application-level checkpointing can address the critical challenge of system failures in exascale supercomputers. Discover the implementation details of iCheck's RDMA-enabled configurable multi-agent-based checkpoint transfer mechanism that minimizes application resource usage. Examine how libfabric library enables RDMA support, allowing remote data access of preregistered memory regions without CPU interference, resulting in improved throughput and reduced latency. Understand the two checkpoint and restart operation methods based on RDMA read and write operations, along with push and pull transfer techniques. See real-world performance improvements demonstrated through integration with applications like ls1 mardyn, LULESH, Jacobi 2D heat simulation, and synthetic applications, achieving up to 5000x better performance compared to traditional in-house checkpointing mechanisms. Read more

iCheck - Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

OpenFabrics Alliance
Add to list