Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote:
> Hi guys!
> 
> There have already been made many attempts to have the checkpoint/restore functionality
> in Linux, but as far as I can see there's still no final solutions that suits most of
> the interested people. The main concern about the previous approaches as I see it was
> about - all that stuff was supposed to sit in the kernel thus creating various problems.
> 
> I'd like to bring this subject back again proposing the way of how to implement c/r
> mostly in the userspace with the reasonable help of a kernel.
> 
> 
> That said, I propose to start with very basic set of objects to c/r that can work with
> 
> * x86_64 tasks (subtree) which includes
>    - registers
>    - TLS
>    - memory of all kinds (file and anon both shared and private)

Do mixes of 32 and 64-bit tasks present any problems with this
method?

> * open regular files
> * pipes (with data in it)
> 
> Core idea:
> 
> The core idea of the restore process is to implement the binary handler that can execve-ute
> image files recreating the register and the memory state of a task. Restoring the process 

I suspect this can be done with Oren's patches too using binfmt-misc -- without any binfmt
kernel code.

> tree and opening files is done completely in the user space, i.e. when restoring the subtree
> of processes I first fork all the tasks in respective order, then open required files and 

OK. Oren's code also forked all the tasks in userspace prior to completing the restart.

> then call execve() to restore registers and memory.

That's kind of neat, but won't this interfere with restoring O_CLOEXEC
flags? (I also asked this in a reply to the TOOLS email)

> 
> The checkpointing process is quite simple - all we need about processes can be read from /proc
> except for several things - registers and private memory. In current implementation to get 

I put this to Tejun as well: What about stuff like epoll sets? Sure, you
can see the epoll fd in /proc/<pid>/fd, but you can't read it to tell
which fds are in it. Worse, even if you got the fds from the epoll items
via /proc, the way epoll holds onto them does not guarantee they'll refer
to the files the set would actuall wait on.

As best I can tell you can't reliably checkpoint epoll sets from userspace.

Then there's the matter of unlinked files. How do you plan to deal
with those without kernel code?

> them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
> described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
> mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
> mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
> /proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
> if required map one and read the contents of anon shared memory.

Finally, I think there's substantial room here for quiet and subtle
races to corrupt checkpoint images. If we add /proc interfaces only to
find they're racy will we need to add yet more /proc interfaces to
maintain backward compatibility yet fix the races? To get the locking
that ensures a consistent subset of information with this /proc-based
approach I think we'll frequently need to change the contents of
existing /proc files.

Imagine trusting the output of top to exactly represent the state of
your system's cpu usage. That's the sort of thing a piecemeal /proc
interface gets us. You're asking us to trust that frequent checkpoints
(say once every five minutes) of large, multiprocess, month-long
program runs won't quietly get corrupted and will leave plenty of
performance to not interfere with the throughput of the work.

A kernel syscall interface has a better chance of allowing us to fix
races without changing the interface. We've fixed a few races with
Oren's tree and none of them required us to change the output format.

Cheers,
	-Matt Helsley
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers


[Index of Archives]     [Cgroups]     [Netdev]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux