Oren Laadan wrote: > Disclaimer: long reply :) > > Serge E. Hallyn wrote: >> Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx): >>> In the recent mini-summit at OLS 2008 and the following days it was >>> agreed to tackle the checkpoint/restart (CR) by beginning with a very >>> simple case: save and restore a single task, with simple memory >>> layout, disregarding other task state such as files, signals etc. >>> >>> Following these discussions I coded a prototype that can do exactly >>> that, as a starter. This code adds two system calls - sys_checkpoint >>> and sys_restart - that a task can call to save and restore its state >>> respectively. It also demonstrates how the checkpoint image file can >>> be formatted, as well as show its nested nature (e.g. cr_write_mm() >>> -> cr_write_vma() nesting). >>> >>> The state that is saved/restored is the following: >>> * some of the task_struct >>> * some of the thread_struct and thread_info >>> * the cpu state (including FPU) >>> * the memory address space >>> >>> [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43 >>> of Linus's tree (uhhh.. don't ask why), but against tonight's head too]. >>> >>> In the current code, sys_checkpoint will checkpoint the current task, >>> although the logic exists to checkpoint other tasks (not in the >>> checkpointee's execution context). A simple loop will extend this to >>> handle multiple processes. sys_restart restarts the current tasks, and >>> with multiple tasks each task will call the syscall independently. >> I assume that approach worked in Zap, so there must be a simple solution >> to this, but I don't see how having each process in a container >> independently call sys_restart works for sharing. Oh, or is that where > > The main reason to do that (and I thought openvz works similarly ?) is > that I want to re-use as much as possible the existing kernel functionality. > Restart differs from checkpoint in that you have to construct new resources > as opposed to only inspect existing resources. To inspect - you only need > a reference to the object and then to obtain its state by accessing it. In > contrast, to construct, you need to create a new resource. > > In almost all cases, creating a resource for a process is easiest if done by > the process itself. For instance - to restore the memory map, you want the > process that owns the target mm to call mmap() (in particular, the lower > level and more convenient for us do_mmap_pgoff() function). If the process > that restores a given vma didn't own that mm, it would take much more pain > to build the vma into a "foreign" mm. > > Thus, there is a huge advantage of doing everything in-context of the target > process, that is - we can re-use the existing kernel code (and spirit) to > create the resources, instead of having to hand-craft them carefully with > specialized code. > >> a 'container restart context' comes in? An nsproxy has a pointer to a > > More or less. At a first approximation, this is how I envision it: > > 0) in user space, a new (empty) container will be created with all the > needed settings for the file system etc (mounts .. and the like) > > 1) the first task (container init) will call sys_restart with the checkpoint > image file. > > 2) the code will verify the header, then read in the global section; it will > create a restart-context which will be referenced from the container-object > (one option we considered is to have the freezer-cgroup be that object). > > 3) using the info from that section, it will create the task tree (forest) > to be restored. In particular, new tasks will be created and each will end > up in do_restart_task() inside the kernel. > > [note that in Zap, step 3 is still done in user space...] > > Since all tasks live in the container, they will all have access to the > restart-context, through which all coordination is done. > > At first, the restart will be performed _one task at a time_, at the order > they were dumped. So while the init task restores itself, the remaining > tasks sleep. When the init task finishes - it will wake the next in line > and so on. The last one will wake the init task to finalize the work. So: > > 4) each task waits (sleeps) until it is prompted to restore its own state. > When it completes, it wakes up the next task in line and goes to a freeze > state. > > 5) the init task finalized the restart, and either completes the freeze or > unfreezes the container, depending on what the user requested. > > This scheme makes sense because we assume that the data is streamed. So it > does not make much sense to try to restart the 5th job before the 2nd job > because the data isn't there yet. Moreover, if they refer to the same shared > object, job#5 will have to wait to job#2 to create the object, since its > state was saved with that job. > > In the future, to speed the process by concurrent restarting multiple tasks, > we'll have to read in data from the stream into a buffer (read-ahead) and > then restarting tasks could skip data that doesn't belongs to them; while > they may still need to wait for shared resources to be created, other work > can be done in parallel in the meanwhile. > >> checkpoint/restart context which the first task creates and all tasks >> reference and update? So task 5 created its mm_struct, task 6 is >> supposed to use the same mm_struct, so it finds that out from the >> context? I wonder whether that would start to become complicated >> when checkpointing nested containers. > > Yes, that's what I had in mind - the restart context holds a hash table > that references all the shared objects that are created during the restart. > (Like the checkpoint context that will hold references to objects that > have been inspected). > > Checkpointing nested containers ??? Why ? > I'm not sure why would that be a problem; but sure, we need to discuss > that using a concrete use-case and identify the needs and difficulties. In the current proposition, we talked about creating an empty container and the first process calls sys_restart. With nested container, we have to CR the container itself no ? I don't see how we can CR nested container otherwise :/ >> So I still prefer the idea that the init process calls restart, and that >> creates all the tasks in the container and rebuilds them. But you have >> code, so you win :) > > I agree: the init task calls restart, and that creates all the tasks in > the container. And then, make each of them call do_restart_task() in > some way :) > >> Anyway I'm still reading through patch 2. It looks great to me - the >> only comments I have written so far are: >> 1. why not just store LINUX_VERSION_CODE in the header instead >> of breaking it up > > hmph ... good question. Avoid 32/64 bit conversion complications ? > >> 2. the x86-specific code should of course go into arch-specific >> directories, but > > of course. I left it there for simplicity right now. > >> neither of which really is worth the bother right now imo :) >> >>> (Actually, to checkpoint outside the context of a task, it is also >>> necessary to also handle restart-block logic when saving/restoring the >>> thread data). >>> >>> It takes longer to describe what isn't implemented or supported by >>> this prototype ... basically everything that isn't as simple as the >>> above. >>> >>> As for containers - since we still don't have a representation for a >>> container, this patch has no notion of a container. The tests for >>> consistent namespaces (and isolation) are also omitted. >>> >>> Below are two example programs: one uses checkpoint (called ckpt) and >>> one uses restart (called rstr). Execute like this (as a superuser): >>> >>> orenl:~/test$ ./ckpt > out.1 >>> hello, world! (ret=1) <-- sys_checkpoint returns positive id >>> <-- ctrl-c >>> orenl:~/test$ ./ckpt > out.2 >>> hello, world! (ret=2) >>> <-- ctrl-c >>> orenl:~/test$ ./rstr < out.1 >>> hello, world! (ret=0) <-- sys_restart return 0 >>> >>> (if you check the output of ps, you'll see that "rstr" changed its >>> name to "ckpt", as expected). >>> >>> Hoping this will accelerate the discussion. Comments are welcome. >>> Let the fun begin :) >>> >>> Oren. >>> >>> >>> ============================== ckpt.c ================================ >>> >>> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> #include <errno.h> >>> #include <fcntl.h> >>> #include <unistd.h> >>> #include <asm/unistd_32.h> >>> #include <sys/syscall.h> >>> >>> int main(int argc, char *argv[]) >>> { >>> pid_t pid = getpid(); >>> int ret; >>> >>> ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0); >>> if (ret < 0) >>> perror("checkpoint"); >>> >>> fprintf(stderr, "hello, world! (ret=%d)\n", ret); >>> >>> while (1) >>> ; >>> >>> return 0; >>> } >>> >>> ============================== rstr.c ================================ >>> >>> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> #include <errno.h> >>> #include <fcntl.h> >>> #include <unistd.h> >>> #include <asm/unistd_32.h> >>> #include <sys/syscall.h> >>> >>> int main(int argc, char *argv[]) >>> { >>> pid_t pid = getpid(); >>> int ret; >>> >>> ret = syscall(__NR_restart, pid, STDIN_FILENO, 0); >>> if (ret < 0) >>> perror("restart"); >>> >>> printf("should not reach here !\n"); >>> >>> return 0; >>> } _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers