On 07/23/2011 04:25 AM, Matt Helsley wrote: > On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote: >> Hi guys! >> >> There have already been made many attempts to have the checkpoint/restore functionality >> in Linux, but as far as I can see there's still no final solutions that suits most of >> the interested people. The main concern about the previous approaches as I see it was >> about - all that stuff was supposed to sit in the kernel thus creating various problems. >> >> I'd like to bring this subject back again proposing the way of how to implement c/r >> mostly in the userspace with the reasonable help of a kernel. >> >> >> That said, I propose to start with very basic set of objects to c/r that can work with >> >> * x86_64 tasks (subtree) which includes >> - registers >> - TLS >> - memory of all kinds (file and anon both shared and private) > > Do mixes of 32 and 64-bit tasks present any problems with this > method? In theory - no. But in practice I didn't write the 32-bit support yet. >> * open regular files >> * pipes (with data in it) >> >> Core idea: >> >> The core idea of the restore process is to implement the binary handler that can execve-ute >> image files recreating the register and the memory state of a task. Restoring the process > > I suspect this can be done with Oren's patches too using binfmt-misc -- without any binfmt > kernel code. > >> tree and opening files is done completely in the user space, i.e. when restoring the subtree >> of processes I first fork all the tasks in respective order, then open required files and > > OK. Oren's code also forked all the tasks in userspace prior to completing the restart. > >> then call execve() to restore registers and memory. > > That's kind of neat, but won't this interfere with restoring O_CLOEXEC > flags? (I also asked this in a reply to the TOOLS email) > >> >> The checkpointing process is quite simple - all we need about processes can be read from /proc >> except for several things - registers and private memory. In current implementation to get > > I put this to Tejun as well: What about stuff like epoll sets? Sure, you > can see the epoll fd in /proc/<pid>/fd, but you can't read it to tell > which fds are in it. Worse, even if you got the fds from the epoll items > via /proc, the way epoll holds onto them does not guarantee they'll refer > to the files the set would actuall wait on. > > As best I can tell you can't reliably checkpoint epoll sets from userspace. With the existing interfaces - yes. My aim was to start the discussion whether we can extend the kernel APIs to make it possible to do so. > Then there's the matter of unlinked files. How do you plan to deal > with those without kernel code? You will have the same problem even with the c/r in the kernel. Frankly, I don't see much difference in where to solve this one, can you elaborate? >> them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the >> described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about >> mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to >> mapped files (including anon shared which are tmpfs ones). Thus we can open some task's >> /proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and >> if required map one and read the contents of anon shared memory. > > Finally, I think there's substantial room here for quiet and subtle > races to corrupt checkpoint images. If we add /proc interfaces only to > find they're racy will we need to add yet more /proc interfaces to > maintain backward compatibility yet fix the races? To get the locking > that ensures a consistent subset of information with this /proc-based > approach I think we'll frequently need to change the contents of > existing /proc files. > > Imagine trusting the output of top to exactly represent the state of > your system's cpu usage. That's the sort of thing a piecemeal /proc > interface gets us. You're asking us to trust that frequent checkpoints > (say once every five minutes) of large, multiprocess, month-long > program runs won't quietly get corrupted and will leave plenty of > performance to not interfere with the throughput of the work. > > A kernel syscall interface has a better chance of allowing us to fix > races without changing the interface. We've fixed a few races with > Oren's tree and none of them required us to change the output format. If we all decide, that we do want to have the checkpoint/restart as all-in-kernel approach, then OK. But my impression is - the community is not happy with it. > Cheers, > -Matt Helsley > . > _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers