Re: [RFC][PATCH 0/2] CR: save/restore a single, simple task

Daniel Lezcano <dlezcano@xxxxxxxxxx> · Thu, 31 Jul 2008 19:15:37 +0200

Oren Laadan wrote:
> 
> Daniel Lezcano wrote:
>> Oren Laadan wrote:
>>> Disclaimer: long reply :)
>>>
>>> Serge E. Hallyn wrote:
>>>> Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx):
>>>>> In the recent mini-summit at OLS 2008 and the following days it was
>>>>> agreed to tackle the checkpoint/restart (CR) by beginning with a very
>>>>> simple case: save and restore a single task, with simple memory
>>>>> layout, disregarding other task state such as files, signals etc.
>>>>>
>>>>> Following these discussions I coded a prototype that can do exactly
>>>>> that, as a starter. This code adds two system calls - sys_checkpoint
>>>>> and sys_restart - that a task can call to save and restore its state
>>>>> respectively. It also demonstrates how the checkpoint image file can
>>>>> be formatted, as well as show its nested nature (e.g. cr_write_mm()
>>>>> -> cr_write_vma() nesting).
>>>>>
>>>>> The state that is saved/restored is the following:
>>>>> * some of the task_struct
>>>>> * some of the thread_struct and thread_info
>>>>> * the cpu state (including FPU)
>>>>> * the memory address space
>>>>>
>>>>> [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
>>>>> of Linus's tree (uhhh.. don't ask why), but against tonight's head 
>>>>> too].
>>>>>
>>>>> In the current code, sys_checkpoint will checkpoint the current task,
>>>>> although the logic exists to checkpoint other tasks (not in the
>>>>> checkpointee's execution context). A simple loop will extend this to
>>>>> handle multiple processes. sys_restart restarts the current tasks, and
>>>>> with multiple tasks each task will call the syscall independently.
>>>> I assume that approach worked in Zap, so there must be a simple solution
>>>> to this, but I don't see how having each process in a container
>>>> independently call sys_restart works for sharing.  Oh, or is that where
>>> The main reason to do that (and I thought openvz works similarly ?) is
>>> that I want to re-use as much as possible the existing kernel 
>>> functionality.
>>> Restart differs from checkpoint in that you have to construct new 
>>> resources
>>> as opposed to only inspect existing resources. To inspect - you only need
>>> a reference to the object and then to obtain its state by accessing 
>>> it. In
>>> contrast, to construct, you need to create a new resource.
>>>
>>> In almost all cases, creating a resource for a process is easiest if 
>>> done by
>>> the process itself. For instance - to restore the memory map, you want 
>>> the
>>> process that owns the target mm to call mmap() (in particular, the lower
>>> level and more convenient for us do_mmap_pgoff() function). If the 
>>> process
>>> that restores a given vma didn't own that mm, it would take much more 
>>> pain
>>> to build the vma into a "foreign" mm.
>>>
>>> Thus, there is a huge advantage of doing everything in-context of the 
>>> target
>>> process, that is - we can re-use the existing kernel code (and spirit) to
>>> create the resources, instead of having to hand-craft them carefully with
>>> specialized code.
>>>
>>>> a 'container restart context' comes in?  An nsproxy has a pointer to a
>>> More or less. At a first approximation, this is how I envision it:
>>>
>>> 0) in user space, a new (empty) container will be created with all the
>>> needed settings for the file system etc (mounts .. and the like)
>>>
>>> 1) the first task (container init) will call sys_restart with the 
>>> checkpoint
>>> image file.
>>>
>>> 2) the code will verify the header, then read in the global section; 
>>> it will
>>> create a restart-context which will be referenced from the 
>>> container-object
>>> (one option we considered is to have the freezer-cgroup be that object).
>>>
>>> 3) using the info from that section, it will create the task tree 
>>> (forest)
>>> to be restored. In particular, new tasks will be created and each will 
>>> end
>>> up in do_restart_task() inside the kernel.
>>>
>>> [note that in Zap, step 3 is still done in user space...]
>>>
>>> Since all tasks live in the container, they will all have access to the
>>> restart-context, through which all coordination is done.
>>>
>>> At first, the restart will be performed _one task at a time_, at the 
>>> order
>>> they were dumped. So while the init task restores itself, the remaining
>>> tasks sleep. When the init task finishes - it will wake the next in line
>>> and so on. The last one will wake the init task to finalize the work. So:
>>>
>>> 4) each task waits (sleeps) until it is prompted to restore its own 
>>> state.
>>> When it completes, it wakes up the next task in line and goes to a freeze
>>> state.
>>>
>>> 5) the init task finalized the restart, and either completes the 
>>> freeze or
>>> unfreezes the container, depending on what the user requested.
>>>
>>> This scheme makes sense because we assume that the data is streamed. 
>>> So it
>>> does not make much sense to try to restart the 5th job before the 2nd job
>>> because the data isn't there yet. Moreover, if they refer to the same 
>>> shared
>>> object, job#5 will have to wait to job#2 to create the object, since its
>>> state was saved with that job.
>>>
>>> In the future, to speed the process by concurrent restarting multiple 
>>> tasks,
>>> we'll have to read in data from the stream into a buffer (read-ahead) and
>>> then restarting tasks could skip data that doesn't belongs to them; while
>>> they may still need to wait for shared resources to be created, other 
>>> work
>>> can be done in parallel in the meanwhile.
>>>
>>>> checkpoint/restart context which the first task creates and all tasks
>>>> reference and update?  So task 5 created its mm_struct, task 6 is
>>>> supposed to use the same mm_struct, so it finds that out from the
>>>> context?  I wonder whether that would start to become complicated
>>>> when checkpointing nested containers.
>>> Yes, that's what I had in mind - the restart context holds a hash table
>>> that references all the shared objects that are created during the 
>>> restart.
>>> (Like the checkpoint context that will hold references to objects that
>>> have been inspected).
>>>
>>> Checkpointing nested containers ???   Why ?
>>> I'm not sure why would that be a problem; but sure, we need to discuss
>>> that using a concrete use-case and identify the needs and difficulties.
>> In the current proposition, we talked about creating an empty container 
>> and the first process calls sys_restart. With nested container, we have 
>> to CR the container itself no ? I don't see how we can CR nested 
>> container otherwise :/
> 
> Probably so: with nested containers it is necessary to also save the state
> of the "container-tree" (which is sort of analogous to task-tree).
> In particular, because tasks in nested containers are essentially part
> of the outermost container that is being checkpointed. Is this issue
> specific to the proposed scheme, or a general issue of any scheme ?

I meant an issue with the proposed scheme. How to sys_restart 
recursively on a pid 1 with nested container if we want to create the 
container and having the first process calling sys_restart ?

But anyway, let's checkpoint a single container before :)

> I think that to tackle this, we need to first agree and implement an
> object that represents a container (again, the freezer_cgroup ?).

Didn't we state on creating a checkpoint/restart control group 
sub-system to have the context allocated ?

>>>> So I still prefer the idea that the init process calls restart, and that
>>>> creates all the tasks in the container and rebuilds them.  But you have
>>>> code, so you win :)
>>> I agree: the init task calls restart, and that creates all the tasks in
>>> the container. And then, make each of them call do_restart_task() in
>>> some way :)
>>>
>>>> Anyway I'm still reading through patch 2.  It looks great to me - the
>>>> only comments I have written so far are:
>>>>     1. why not just store LINUX_VERSION_CODE in the header instead
>>>>     of breaking it up
>>> hmph ... good question. Avoid 32/64 bit conversion complications ?
>>>
>>>>     2. the x86-specific code should of course go into arch-specific
>>>>     directories, but 
>>> of course. I left it there for simplicity right now.
>>>
>>>> neither of which really is worth the bother right now imo :)
>>>>
>>>>> (Actually, to checkpoint outside the context of a task, it is also
>>>>> necessary to also handle restart-block logic when saving/restoring the
>>>>> thread data).
>>>>>
>>>>> It takes longer to describe what isn't implemented or supported by
>>>>> this prototype ... basically everything that isn't as simple as the
>>>>> above.
>>>>>
>>>>> As for containers - since we still don't have a representation for a
>>>>> container, this patch has no notion of a container. The tests for
>>>>> consistent namespaces (and isolation) are also omitted.
>>>>>
>>>>> Below are two example programs: one uses checkpoint (called ckpt) and
>>>>> one uses restart (called rstr). Execute like this (as a superuser):
>>>>>
>>>>> orenl:~/test$ ./ckpt > out.1
>>>>> hello, world!  (ret=1)        <-- sys_checkpoint returns positive id
>>>>>                  <-- ctrl-c
>>>>> orenl:~/test$ ./ckpt > out.2
>>>>> hello, world!  (ret=2)
>>>>>                  <-- ctrl-c
>>>>> orenl:~/test$ ./rstr < out.1
>>>>> hello, world!  (ret=0)        <-- sys_restart return 0
>>>>>
>>>>> (if you check the output of ps, you'll see that "rstr" changed its
>>>>> name to "ckpt", as expected).
>>>>>
>>>>> Hoping this will accelerate the discussion. Comments are welcome.
>>>>> Let the fun begin :)
>>>>>
>>>>> Oren.
>>>>>
>>>>>
>>>>> ============================== ckpt.c ================================
>>>>>
>>>>> #define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <errno.h>
>>>>> #include <fcntl.h>
>>>>> #include <unistd.h>
>>>>> #include <asm/unistd_32.h>
>>>>> #include <sys/syscall.h>
>>>>>
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>>      pid_t pid = getpid();
>>>>>      int ret;
>>>>>
>>>>>      ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
>>>>>      if (ret < 0)
>>>>>          perror("checkpoint");
>>>>>
>>>>>      fprintf(stderr, "hello, world!  (ret=%d)\n", ret);
>>>>>
>>>>>      while (1)
>>>>>          ;
>>>>>
>>>>>      return 0;
>>>>> }
>>>>>
>>>>> ============================== rstr.c ================================
>>>>>
>>>>> #define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <errno.h>
>>>>> #include <fcntl.h>
>>>>> #include <unistd.h>
>>>>> #include <asm/unistd_32.h>
>>>>> #include <sys/syscall.h>
>>>>>
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>>      pid_t pid = getpid();
>>>>>      int ret;
>>>>>
>>>>>      ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
>>>>>      if (ret < 0)
>>>>>          perror("restart");
>>>>>
>>>>>      printf("should not reach here !\n");
>>>>>
>>>>>      return 0;
>>>>> }
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers