Notes from Checkpoint/Restart BOF at Linux Plumbers Conference, Sep 24, 2009. (I am missing some details and couple of names. They said they were on Containers mailing list though. If you have any other topics that we discussed or have any details, please add to this mail). --- Attendees: Oren Laadan, Joeseph Ruscio, <One more person> (Librato) Pavel Emelyanov, <One more person ?> (OpenVZ) Ying Han, Salman Qazi (Google) Dan Smith, Matt Helsley, Sukadev Bhattiprolu (IBM) 1. Pavel: A few months ago there were discussions about making a "dry-run" to see if checkpoint of an application will succeed. What is the current status of that ? The answer was there is no dry-run - user should just try the actual C/R. If application is using an uncheckpointable resource the C/R will fail cleanly without side-effects. The dry-run may not mean anything unless we freeze the application during the check and leave it frozen until the checkpoint is done. IOW, the dry-run does not guarantee that application is checkpointable unless the application is frozen. 2. Pavel: Alexey Dobriyan had earlier submitted some code for leak-detection. Do we still have that ? The answer was that most of the code was used and we also added reverse detection. 3. Do we have a config-option to make a process checkpointable. <Missed the context of this question> We have CONFIG_CHECKPOINT. 4 Checkpointing network connections: We quickly reviewed the status (AF_UNIX done, AF_INET done in a prototype and needs to be forward ported). Checkpoint of one-end of a network connection can cause the connection to be reset. 5. Briefly discussed distinction between Live migration and static migration 6. Do we need a pre-check during restart to ensure that the application can be restarted ? Eg: if the application used a specific math co-processor or futex at checkpoint and that resource is not available at restart, the restart may encounter some undefined behavior. Should we encode the hardware/OS capabilities in the checkpoint image and check these capabilities during restart (before actual restart). Reason for this check being the restart may not fail cleanly if the resource is missing. Conclusion was that there could be too many such capabilities that we would have to track and even so there may be some unexpected difference between checkpoint machine and restart machine. For now, let the restart fail and/or deal with in user-space. 7. Discussed briefly about clone2() aka clone_with_pids(). Everyone seemed to agree that restoring process-tree even in user-space will work and can be used. 8. Oren: Error reporting during restart We currently fail the system call with an error code and if we ant more information on the failure, we have to add debug messages to the code. We discussed couple of options for error reporting on restart: - log detailed message(s) to console (risk wrapping dmesg buf) - pass an extra-buffer to the system call and have kernel fill-in more detailed error message (would need two new parameters, one pointer to the buf, one size of the buf). - Pass-in an extra 'log_fd' parameter to system call and have kernel write detailed messags to that log_fd (unless log_fd is -1). This seemed more flexible than the other two. We agreed that the format of the log messages can be free-format and that there is no guarantee that the format of the log messages will not change. But it was not clear (at least to me) if the log file should contain all log messages relating to the C/R or just the last (few) error messages. 9. Any application to summarize the checkpoint ? We have a 'ckptinfo' that could summarize the contents of a checkpoint. 10. Ying Han: Is there a performance difference between the original instance of the application and the restarted instance ? (Eg: on NUMA if application was on one node at checkpoint and after restart, ended up on another node). Not sure if there was a conclusion to this point. 11. Discussed that devices like tty, /dev/rtc etc must be virtualized before we can checkpoint them. 12. Oren: Checkpointing/Restoring mount namespaces Bind mounts are restored in container. NFS: at least on OpenVZ, since network is frozen, reopening files over NFS is not possible until restart is complete. OpenVZ creates fake dentries to allow the open to proceed. Loopback devices - cannot open them in a container since they can lockup system with huge memory footprint ?? We should disable shared-mount propogation at least for now. 13. Oren: cradvise() Use a single system call to optimize the checkpoint/restart ? Eg: If an fd refers to /dev/tty1 in the checkpoint-image and that tty is not available on restart, user-space could open another tty and teach the kernel to use a different tty, /dev/tty2, during restart. Another example is if an application has several megs of "scratch" memory that does not need to checkpointed, they could use 'cradvise') system call to optimize the checkpoint or restart. The conclusion was it would be hard to get acceptance from community, for a new variant of ioctl/fcntl call. So, we should instead try to add the necessary features to existing system calls like fcntl(), shmctl() or madvise(). 14. Oren: Unlinked files/directories May need to copy the contents of the deleted file to the checkpoint image (only on ext4?). Create a fake hard link to the file so the file still exists in the filesystem snapshot and remove the link during restart. There is a good paper discussing snapshot/restore of unlinked files on Xen. The same concept could be used in C/R too ? (If you have links to the paper, please add) 15. Network namespaces Restore namespaces in user-space, restore sockets in-kernel. Cannot create devices in user-space unless we know the index for the network device ? (Missed details on this discussion) 16. Time Will need some policies on restart like: - use absolute time or relative time - do new children inherit the policy ? - do we gradually adjust from relative to absolute time ? If not cradvise(), maybe timectl() :-p 17. VDSO (Missed details on this discussion) 18. Async I/O Getting a lockdep report during checkpoint ? OpenVZ flushes I/O, waits for pending I/O and then retries checkpoint We may need to the do the same for mmap I/O ? 19. Checkpoint data structures: - Try to keep extensions to existing data structures minimal - If necessary, add to end of data structures - But do not get locked down to an ABI at this point. i.e. even after entering mainline, format of checkpoint image may change for a while before stabilizing. 20. Test suite: OpenVZ has some test cases that has various applications go to specific states and wait for a checkpoint. After that and after restart they check that nothing has changed unexpectedly. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers