Re: [RFC][PATCH 0/2] CR: save/restore a single, simple task

Oren Laadan <orenl@xxxxxxxxxxxxxxx> · Thu, 31 Jul 2008 11:25:22 -0400

Daniel Lezcano wrote:
> Oren Laadan wrote:
>> Disclaimer: long reply :)
>>
>> Serge E. Hallyn wrote:
>>> Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx):
>>>> In the recent mini-summit at OLS 2008 and the following days it was
>>>> agreed to tackle the checkpoint/restart (CR) by beginning with a very
>>>> simple case: save and restore a single task, with simple memory
>>>> layout, disregarding other task state such as files, signals etc.
>>>>
>>>> Following these discussions I coded a prototype that can do exactly
>>>> that, as a starter. This code adds two system calls - sys_checkpoint
>>>> and sys_restart - that a task can call to save and restore its state
>>>> respectively. It also demonstrates how the checkpoint image file can
>>>> be formatted, as well as show its nested nature (e.g. cr_write_mm()
>>>> -> cr_write_vma() nesting).
>>>>
>>>> The state that is saved/restored is the following:
>>>> * some of the task_struct
>>>> * some of the thread_struct and thread_info
>>>> * the cpu state (including FPU)
>>>> * the memory address space
>>>>
>>>> [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
>>>> of Linus's tree (uhhh.. don't ask why), but against tonight's head 
>>>> too].
>>>>
>>>> In the current code, sys_checkpoint will checkpoint the current task,
>>>> although the logic exists to checkpoint other tasks (not in the
>>>> checkpointee's execution context). A simple loop will extend this to
>>>> handle multiple processes. sys_restart restarts the current tasks, and
>>>> with multiple tasks each task will call the syscall independently.
>>> I assume that approach worked in Zap, so there must be a simple solution
>>> to this, but I don't see how having each process in a container
>>> independently call sys_restart works for sharing.  Oh, or is that where
>>
>> The main reason to do that (and I thought openvz works similarly ?) is
>> that I want to re-use as much as possible the existing kernel 
>> functionality.
>> Restart differs from checkpoint in that you have to construct new 
>> resources
>> as opposed to only inspect existing resources. To inspect - you only need
>> a reference to the object and then to obtain its state by accessing 
>> it. In
>> contrast, to construct, you need to create a new resource.
>>
>> In almost all cases, creating a resource for a process is easiest if 
>> done by
>> the process itself. For instance - to restore the memory map, you want 
>> the
>> process that owns the target mm to call mmap() (in particular, the lower
>> level and more convenient for us do_mmap_pgoff() function). If the 
>> process
>> that restores a given vma didn't own that mm, it would take much more 
>> pain
>> to build the vma into a "foreign" mm.
>>
>> Thus, there is a huge advantage of doing everything in-context of the 
>> target
>> process, that is - we can re-use the existing kernel code (and spirit) to
>> create the resources, instead of having to hand-craft them carefully with
>> specialized code.
>>
>>> a 'container restart context' comes in?  An nsproxy has a pointer to a
>>
>> More or less. At a first approximation, this is how I envision it:
>>
>> 0) in user space, a new (empty) container will be created with all the
>> needed settings for the file system etc (mounts .. and the like)
>>
>> 1) the first task (container init) will call sys_restart with the 
>> checkpoint
>> image file.
>>
>> 2) the code will verify the header, then read in the global section; 
>> it will
>> create a restart-context which will be referenced from the 
>> container-object
>> (one option we considered is to have the freezer-cgroup be that object).
>>
>> 3) using the info from that section, it will create the task tree 
>> (forest)
>> to be restored. In particular, new tasks will be created and each will 
>> end
>> up in do_restart_task() inside the kernel.
>>
>> [note that in Zap, step 3 is still done in user space...]
>>
>> Since all tasks live in the container, they will all have access to the
>> restart-context, through which all coordination is done.
>>
>> At first, the restart will be performed _one task at a time_, at the 
>> order
>> they were dumped. So while the init task restores itself, the remaining
>> tasks sleep. When the init task finishes - it will wake the next in line
>> and so on. The last one will wake the init task to finalize the work. So:
>>
>> 4) each task waits (sleeps) until it is prompted to restore its own 
>> state.
>> When it completes, it wakes up the next task in line and goes to a freeze
>> state.
>>
>> 5) the init task finalized the restart, and either completes the 
>> freeze or
>> unfreezes the container, depending on what the user requested.
>>
>> This scheme makes sense because we assume that the data is streamed. 
>> So it
>> does not make much sense to try to restart the 5th job before the 2nd job
>> because the data isn't there yet. Moreover, if they refer to the same 
>> shared
>> object, job#5 will have to wait to job#2 to create the object, since its
>> state was saved with that job.
>>
>> In the future, to speed the process by concurrent restarting multiple 
>> tasks,
>> we'll have to read in data from the stream into a buffer (read-ahead) and
>> then restarting tasks could skip data that doesn't belongs to them; while
>> they may still need to wait for shared resources to be created, other 
>> work
>> can be done in parallel in the meanwhile.
>>
>>> checkpoint/restart context which the first task creates and all tasks
>>> reference and update?  So task 5 created its mm_struct, task 6 is
>>> supposed to use the same mm_struct, so it finds that out from the
>>> context?  I wonder whether that would start to become complicated
>>> when checkpointing nested containers.
>>
>> Yes, that's what I had in mind - the restart context holds a hash table
>> that references all the shared objects that are created during the 
>> restart.
>> (Like the checkpoint context that will hold references to objects that
>> have been inspected).
>>
>> Checkpointing nested containers ???   Why ?
>> I'm not sure why would that be a problem; but sure, we need to discuss
>> that using a concrete use-case and identify the needs and difficulties.
> 
> In the current proposition, we talked about creating an empty container 
> and the first process calls sys_restart. With nested container, we have 
> to CR the container itself no ? I don't see how we can CR nested 
> container otherwise :/

Probably so: with nested containers it is necessary to also save the state
of the "container-tree" (which is sort of analogous to task-tree).
In particular, because tasks in nested containers are essentially part
of the outermost container that is being checkpointed. Is this issue
specific to the proposed scheme, or a general issue of any scheme ?

I think that to tackle this, we need to first agree and implement an
object that represents a container (again, the freezer_cgroup ?).

Oren.

> 
>>> So I still prefer the idea that the init process calls restart, and that
>>> creates all the tasks in the container and rebuilds them.  But you have
>>> code, so you win :)
>>
>> I agree: the init task calls restart, and that creates all the tasks in
>> the container. And then, make each of them call do_restart_task() in
>> some way :)
>>
>>> Anyway I'm still reading through patch 2.  It looks great to me - the
>>> only comments I have written so far are:
>>>     1. why not just store LINUX_VERSION_CODE in the header instead
>>>     of breaking it up
>>
>> hmph ... good question. Avoid 32/64 bit conversion complications ?
>>
>>>     2. the x86-specific code should of course go into arch-specific
>>>     directories, but 
>>
>> of course. I left it there for simplicity right now.
>>
>>> neither of which really is worth the bother right now imo :)
>>>
>>>> (Actually, to checkpoint outside the context of a task, it is also
>>>> necessary to also handle restart-block logic when saving/restoring the
>>>> thread data).
>>>>
>>>> It takes longer to describe what isn't implemented or supported by
>>>> this prototype ... basically everything that isn't as simple as the
>>>> above.
>>>>
>>>> As for containers - since we still don't have a representation for a
>>>> container, this patch has no notion of a container. The tests for
>>>> consistent namespaces (and isolation) are also omitted.
>>>>
>>>> Below are two example programs: one uses checkpoint (called ckpt) and
>>>> one uses restart (called rstr). Execute like this (as a superuser):
>>>>
>>>> orenl:~/test$ ./ckpt > out.1
>>>> hello, world!  (ret=1)        <-- sys_checkpoint returns positive id
>>>>                  <-- ctrl-c
>>>> orenl:~/test$ ./ckpt > out.2
>>>> hello, world!  (ret=2)
>>>>                  <-- ctrl-c
>>>> orenl:~/test$ ./rstr < out.1
>>>> hello, world!  (ret=0)        <-- sys_restart return 0
>>>>
>>>> (if you check the output of ps, you'll see that "rstr" changed its
>>>> name to "ckpt", as expected).
>>>>
>>>> Hoping this will accelerate the discussion. Comments are welcome.
>>>> Let the fun begin :)
>>>>
>>>> Oren.
>>>>
>>>>
>>>> ============================== ckpt.c ================================
>>>>
>>>> #define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <errno.h>
>>>> #include <fcntl.h>
>>>> #include <unistd.h>
>>>> #include <asm/unistd_32.h>
>>>> #include <sys/syscall.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>>      pid_t pid = getpid();
>>>>      int ret;
>>>>
>>>>      ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
>>>>      if (ret < 0)
>>>>          perror("checkpoint");
>>>>
>>>>      fprintf(stderr, "hello, world!  (ret=%d)\n", ret);
>>>>
>>>>      while (1)
>>>>          ;
>>>>
>>>>      return 0;
>>>> }
>>>>
>>>> ============================== rstr.c ================================
>>>>
>>>> #define _GNU_SOURCE        /* or _BSD_SOURCE or _SVID_SOURCE */
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <errno.h>
>>>> #include <fcntl.h>
>>>> #include <unistd.h>
>>>> #include <asm/unistd_32.h>
>>>> #include <sys/syscall.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>>      pid_t pid = getpid();
>>>>      int ret;
>>>>
>>>>      ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
>>>>      if (ret < 0)
>>>>          perror("restart");
>>>>
>>>>      printf("should not reach here !\n");
>>>>
>>>>      return 0;
>>>> }
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers