Daniel Lezcano wrote: > Oren Laadan wrote: >> Disclaimer: long reply :) >> >> Serge E. Hallyn wrote: >>> Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx): >>>> In the recent mini-summit at OLS 2008 and the following days it was >>>> agreed to tackle the checkpoint/restart (CR) by beginning with a very >>>> simple case: save and restore a single task, with simple memory >>>> layout, disregarding other task state such as files, signals etc. >>>> >>>> Following these discussions I coded a prototype that can do exactly >>>> that, as a starter. This code adds two system calls - sys_checkpoint >>>> and sys_restart - that a task can call to save and restore its state >>>> respectively. It also demonstrates how the checkpoint image file can >>>> be formatted, as well as show its nested nature (e.g. cr_write_mm() >>>> -> cr_write_vma() nesting). >>>> >>>> The state that is saved/restored is the following: >>>> * some of the task_struct >>>> * some of the thread_struct and thread_info >>>> * the cpu state (including FPU) >>>> * the memory address space >>>> >>>> [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43 >>>> of Linus's tree (uhhh.. don't ask why), but against tonight's head >>>> too]. >>>> >>>> In the current code, sys_checkpoint will checkpoint the current task, >>>> although the logic exists to checkpoint other tasks (not in the >>>> checkpointee's execution context). A simple loop will extend this to >>>> handle multiple processes. sys_restart restarts the current tasks, and >>>> with multiple tasks each task will call the syscall independently. >>> I assume that approach worked in Zap, so there must be a simple solution >>> to this, but I don't see how having each process in a container >>> independently call sys_restart works for sharing. Oh, or is that where >> >> The main reason to do that (and I thought openvz works similarly ?) is >> that I want to re-use as much as possible the existing kernel >> functionality. >> Restart differs from checkpoint in that you have to construct new >> resources >> as opposed to only inspect existing resources. To inspect - you only need >> a reference to the object and then to obtain its state by accessing >> it. In >> contrast, to construct, you need to create a new resource. >> >> In almost all cases, creating a resource for a process is easiest if >> done by >> the process itself. For instance - to restore the memory map, you want >> the >> process that owns the target mm to call mmap() (in particular, the lower >> level and more convenient for us do_mmap_pgoff() function). If the >> process >> that restores a given vma didn't own that mm, it would take much more >> pain >> to build the vma into a "foreign" mm. >> >> Thus, there is a huge advantage of doing everything in-context of the >> target >> process, that is - we can re-use the existing kernel code (and spirit) to >> create the resources, instead of having to hand-craft them carefully with >> specialized code. >> >>> a 'container restart context' comes in? An nsproxy has a pointer to a >> >> More or less. At a first approximation, this is how I envision it: >> >> 0) in user space, a new (empty) container will be created with all the >> needed settings for the file system etc (mounts .. and the like) >> >> 1) the first task (container init) will call sys_restart with the >> checkpoint >> image file. >> >> 2) the code will verify the header, then read in the global section; >> it will >> create a restart-context which will be referenced from the >> container-object >> (one option we considered is to have the freezer-cgroup be that object). >> >> 3) using the info from that section, it will create the task tree >> (forest) >> to be restored. In particular, new tasks will be created and each will >> end >> up in do_restart_task() inside the kernel. >> >> [note that in Zap, step 3 is still done in user space...] >> >> Since all tasks live in the container, they will all have access to the >> restart-context, through which all coordination is done. >> >> At first, the restart will be performed _one task at a time_, at the >> order >> they were dumped. So while the init task restores itself, the remaining >> tasks sleep. When the init task finishes - it will wake the next in line >> and so on. The last one will wake the init task to finalize the work. So: >> >> 4) each task waits (sleeps) until it is prompted to restore its own >> state. >> When it completes, it wakes up the next task in line and goes to a freeze >> state. >> >> 5) the init task finalized the restart, and either completes the >> freeze or >> unfreezes the container, depending on what the user requested. >> >> This scheme makes sense because we assume that the data is streamed. >> So it >> does not make much sense to try to restart the 5th job before the 2nd job >> because the data isn't there yet. Moreover, if they refer to the same >> shared >> object, job#5 will have to wait to job#2 to create the object, since its >> state was saved with that job. >> >> In the future, to speed the process by concurrent restarting multiple >> tasks, >> we'll have to read in data from the stream into a buffer (read-ahead) and >> then restarting tasks could skip data that doesn't belongs to them; while >> they may still need to wait for shared resources to be created, other >> work >> can be done in parallel in the meanwhile. >> >>> checkpoint/restart context which the first task creates and all tasks >>> reference and update? So task 5 created its mm_struct, task 6 is >>> supposed to use the same mm_struct, so it finds that out from the >>> context? I wonder whether that would start to become complicated >>> when checkpointing nested containers. >> >> Yes, that's what I had in mind - the restart context holds a hash table >> that references all the shared objects that are created during the >> restart. >> (Like the checkpoint context that will hold references to objects that >> have been inspected). >> >> Checkpointing nested containers ??? Why ? >> I'm not sure why would that be a problem; but sure, we need to discuss >> that using a concrete use-case and identify the needs and difficulties. > > In the current proposition, we talked about creating an empty container > and the first process calls sys_restart. With nested container, we have > to CR the container itself no ? I don't see how we can CR nested > container otherwise :/ Probably so: with nested containers it is necessary to also save the state of the "container-tree" (which is sort of analogous to task-tree). In particular, because tasks in nested containers are essentially part of the outermost container that is being checkpointed. Is this issue specific to the proposed scheme, or a general issue of any scheme ? I think that to tackle this, we need to first agree and implement an object that represents a container (again, the freezer_cgroup ?). Oren. > >>> So I still prefer the idea that the init process calls restart, and that >>> creates all the tasks in the container and rebuilds them. But you have >>> code, so you win :) >> >> I agree: the init task calls restart, and that creates all the tasks in >> the container. And then, make each of them call do_restart_task() in >> some way :) >> >>> Anyway I'm still reading through patch 2. It looks great to me - the >>> only comments I have written so far are: >>> 1. why not just store LINUX_VERSION_CODE in the header instead >>> of breaking it up >> >> hmph ... good question. Avoid 32/64 bit conversion complications ? >> >>> 2. the x86-specific code should of course go into arch-specific >>> directories, but >> >> of course. I left it there for simplicity right now. >> >>> neither of which really is worth the bother right now imo :) >>> >>>> (Actually, to checkpoint outside the context of a task, it is also >>>> necessary to also handle restart-block logic when saving/restoring the >>>> thread data). >>>> >>>> It takes longer to describe what isn't implemented or supported by >>>> this prototype ... basically everything that isn't as simple as the >>>> above. >>>> >>>> As for containers - since we still don't have a representation for a >>>> container, this patch has no notion of a container. The tests for >>>> consistent namespaces (and isolation) are also omitted. >>>> >>>> Below are two example programs: one uses checkpoint (called ckpt) and >>>> one uses restart (called rstr). Execute like this (as a superuser): >>>> >>>> orenl:~/test$ ./ckpt > out.1 >>>> hello, world! (ret=1) <-- sys_checkpoint returns positive id >>>> <-- ctrl-c >>>> orenl:~/test$ ./ckpt > out.2 >>>> hello, world! (ret=2) >>>> <-- ctrl-c >>>> orenl:~/test$ ./rstr < out.1 >>>> hello, world! (ret=0) <-- sys_restart return 0 >>>> >>>> (if you check the output of ps, you'll see that "rstr" changed its >>>> name to "ckpt", as expected). >>>> >>>> Hoping this will accelerate the discussion. Comments are welcome. >>>> Let the fun begin :) >>>> >>>> Oren. >>>> >>>> >>>> ============================== ckpt.c ================================ >>>> >>>> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ >>>> >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> #include <errno.h> >>>> #include <fcntl.h> >>>> #include <unistd.h> >>>> #include <asm/unistd_32.h> >>>> #include <sys/syscall.h> >>>> >>>> int main(int argc, char *argv[]) >>>> { >>>> pid_t pid = getpid(); >>>> int ret; >>>> >>>> ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0); >>>> if (ret < 0) >>>> perror("checkpoint"); >>>> >>>> fprintf(stderr, "hello, world! (ret=%d)\n", ret); >>>> >>>> while (1) >>>> ; >>>> >>>> return 0; >>>> } >>>> >>>> ============================== rstr.c ================================ >>>> >>>> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ >>>> >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> #include <errno.h> >>>> #include <fcntl.h> >>>> #include <unistd.h> >>>> #include <asm/unistd_32.h> >>>> #include <sys/syscall.h> >>>> >>>> int main(int argc, char *argv[]) >>>> { >>>> pid_t pid = getpid(); >>>> int ret; >>>> >>>> ret = syscall(__NR_restart, pid, STDIN_FILENO, 0); >>>> if (ret < 0) >>>> perror("restart"); >>>> >>>> printf("should not reach here !\n"); >>>> >>>> return 0; >>>> } _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers