Explore task migration at scale using CRIU in this Linux Plumbers Conference talk. Dive into Google's experience implementing Checkpoint/Restore in Userspace (CRIU) for migrating container workloads between machines without losing application state. Learn about the challenges of supporting production workloads, integrating with existing container infrastructure, and managing migratable containers at scale. Discover the impact on efficiency and utilization in Google's computing infrastructure, which manages millions of simultaneous jobs in data centers worldwide. Gain insights into the current state of the CRIU project, new requirements for large-scale implementation, and lessons learned from practical application. Explore topics such as networking, storage, task environment, performance, user experience, and adoption challenges. Discuss potential improvements in CRIU, including performance, security, and time handling. Consider the future direction of CRIU and task migration in Linux as a whole, including migration time optimization, weight feed, scheduling, remote storage, and persistent disk implementation.
Read more