Thanks for reading carefully through and pointing out glitches and inconsistencies. I'll fix it for next post. Oren. On 05/06/2010 04:27 PM, Randy Dunlap wrote: > On Sat, 1 May 2010 10:15:02 -0400 Oren Laadan wrote: > >> Covers application checkpoint/restart, overall design, interfaces, >> usage, shared objects, and and checkpoint image format. >> >> Signed-off-by: Oren Laadan <orenl@xxxxxxxxxxxxxxx> >> Signed-off-by: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx> >> Acked-by: Serge E. Hallyn <serue@xxxxxxxxxx> >> Tested-by: Serge E. Hallyn <serue@xxxxxxxxxx> >> --- >> Documentation/checkpoint/checkpoint.c | 38 +++ >> Documentation/checkpoint/readme.txt | 370 ++++++++++++++++++++++++++++ >> Documentation/checkpoint/self_checkpoint.c | 69 +++++ >> Documentation/checkpoint/self_restart.c | 40 +++ >> Documentation/checkpoint/usage.txt | 247 +++++++++++++++++++ >> 5 files changed, 764 insertions(+), 0 deletions(-) >> create mode 100644 Documentation/checkpoint/checkpoint.c >> create mode 100644 Documentation/checkpoint/readme.txt >> create mode 100644 Documentation/checkpoint/self_checkpoint.c >> create mode 100644 Documentation/checkpoint/self_restart.c >> create mode 100644 Documentation/checkpoint/usage.txt > >> diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt >> new file mode 100644 >> index 0000000..4fa5560 >> --- /dev/null >> +++ b/Documentation/checkpoint/readme.txt >> @@ -0,0 +1,370 @@ >> + > ... >> +In contrast, when checkpointing a subtree of a container it is up to >> +the user to ensure that dependencies either don't exist or can be >> +safely ignored. This is useful, for instance, for HPC scenarios or >> +even a user that would like to periodically checkpoint a long-running > > who > >> +batch job. >> + > ... > >> + >> +Checkpoint image format >> +======================= >> + > ... > >> + >> +The container configuration section containers information that is > > contains > >> +global to the container. Security (LSM) configuration is one example. >> +Network configuration and container-wide mounts may also go here, so >> +that the userspace restart coordinator can re-create a suitable >> +environment. >> + > ... > >> + >> +Then the state of all tasks is saved, in the order that they appear in >> +the tasks array above. For each state, we save data like task_struct, >> +namespaces, open files, memory layout, memory contents, cpu state, > > CPU (throughout, please) > >> +signals and signal handlers, etc. For resources that are shared among >> +multiple processes, we first checkpoint said resource (and only once), >> +and in the task data we give a reference to it. More about shared >> +resources below. >> + > ... > >> + >> +Shared objects >> +============== >> + >> +Many resources may be shared by multiple tasks (e.g. file descriptors, >> +memory address space, etc), or even have multiple references from > > etc.), > >> +other resources (e.g. a single inode that represents two ends of a >> +pipe). >> + > ... > >> +Memory contents format >> +====================== >> + >> +The memory contents of a given memory address space (->mm) is dumped > > are (I think) > >> +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'. >> +This header details the vma properties, and a reference to a file >> +(if file backed) or an inode (or shared memory) object. >> + >> +The vma header is followed by the actual contents - but only those >> +pages that need to be saved, i.e. dirty pages. They are written in >> +chunks of data, where each chunks contains a header that indicates > > chunk > >> +that number of pages in the chunk, followed by an array of virtual > > the > >> +addresses and then an array of actual page contents. The last chunk >> +holds zero pages. >> + > ... > >> +Kernel interfaces >> +================= >> + >> +* To checkpoint a vma, the 'struct vm_operations_struct' needs to >> + provide a method ->checkpoint: >> + int checkpoint(struct ckpt_ctx *, struct vma_struct *) >> + Restart requires a matching (exported) restore: >> + int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *) >> + >> +* To checkpoint a file, the 'struct file_operations' needs to provide >> + the methods ->checkpoint and ->collect: >> + int checkpoint(struct ckpt_ctx *, struct file *) >> + int collect(struct ckpt_ctx *, struct file *) >> + Restart requires a matching (exported) restore: >> + int restore(struct ckpt_ctx *, struct ckpt_hdr_file *) >> + For most file systems, generic_file_{checkpoint,restore}() can be >> + used. >> + >> +* To checkpoint a socket, the 'struct proto_ops' needs to provide > > To checkpoint/restart a socket, > >> + the methods ->checkpoint, ->collect and ->restore: >> + int checkpoint(struct ckpt_ctx *ctx, struct socket *sock); >> + int collect(struct ckpt_ctx *ctx, struct socket *sock); >> + int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h) > > >> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt >> new file mode 100644 >> index 0000000..c6fc045 >> --- /dev/null >> +++ b/Documentation/checkpoint/usage.txt >> @@ -0,0 +1,247 @@ >> + >> + How to use Checkpoint-Restart >> + ========================================= >> + >> + >> +API >> +=== >> + >> +The API consists of three new system calls: >> + >> +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd); > > flags, > >> + >> + Checkpoint a (sub-)container whose root task is identified by @pid, >> + to the open file indicated by @fd. If @logfd isn't -1, it indicates >> + an open file to which error and debug messages are written. @flags >> + may be one or more of: >> + - CHECKPOINT_SUBTREE : allow checkpoint of sub-container >> + (other value are not allowed). >> + >> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if >> + it returns from a restart, and -1 if an error occurs. The ckptid will >> + uniquely identify a checkpoint image, for as long as the checkpoint >> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a >> + partial checkpoint, residing in kernel memory). >> + >> +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd); >> + >> + Restart a process hierarchy from a checkpoint image that is read from >> + the blob stored in the file indicated by @fd. If @logfd isn't -1, it >> + indicates an open file to which error and debug messages are written. >> + @flags will have future meaning (must be 0 for now). @pid indicates >> + the root of the hierarchy as seen in the coordinator's pid-namespace, >> + and is expected to be a child of the coordinator. @flags may be one >> + or more of: >> + - RESTART_TASKSELF : (self) restart of a single process >> + - RESTART_FROEZN : processes remain frozen once restart completes > > FROZEN ? > >> + - RESTART_GHOST : process is a ghost (placeholder for a pid) > > about @flags: Above says both of these: > a) @flags will have future meaning (must be 0 for now) > b) @flags may be one or more of: > > so please decide which one it is ;) > >> + (Note that this argument may mean 'ckptid' to identify an in-kernel >> + checkpoint image, with some @flags in the future). >> + >> + Returns: -1 if an error occurs, 0 on success when restarting from a >> + "self" checkpoint, and return value of system call at the time of the >> + checkpoint when restarting from an "external" checkpoint. >> + > ... >> + >> +Sysctl/proc >> +=========== >> + >> +/proc/sys/kernel/ckpt_unpriv_allowed [default = 1] >> + controls whether c/r operation is allowed for unprivileged users > > C/R > >> + >> + >> +Operation >> +========= >> + >> +The granularity of a checkpoint usually is a process hierarchy. The >> +'pid' argument is interpreted in the caller's pid namespace. So to >> +checkpoint a container whose init task (pid 1 in that pidns) appears >> +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing >> +pid 1 will attempt to checkpoint the caller's container, and if the >> +caller isn't privileged and init is owned by root, it will fail. >> + >> +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid >> +which does not refer to a container's init task, then sys_checkpoint() >> +would return -EINVAL. > > returns -EINVAL. > > ... > >> + >> + >> +User tools >> +========== >> + >> +* checkpoint(1): a tool to perform a checkpoint of a container/subtree >> +* restart(1): a tool to restart a container/subtree >> +* ckptinfo: a tool to examine a checkpoint image >> + >> +It is best to use the dedicated user tools for checkpoint and restart. >> + >> +If you insist, then here is a code snippet that illustrates how a >> +checkpoint is initiated by a process inside a container - the logic is >> +similar to fork(): >> + ... >> + ckptid = checkpoint(0, ...); >> + switch (crid) { > > (ckptid) ? > >> + case -1: >> + perror("checkpoint failed"); >> + break; >> + default: >> + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret); > > s/ret/ckptid/ ? > >> + /* proceed with execution after checkpoint */ >> + ... >> + break; >> + case 0: >> + fprintf(stderr, "returned after restart\n"); >> + /* proceed with action required following a restart */ >> + ... >> + break; >> + } >> + ... >> + >> +And to initiate a restart, the process in an empty container can use >> +logic similar to execve(): >> + ... >> + if (restart(pid, ...) < 0) >> + perror("restart failed"); >> + /* only get here if restart failed */ >> + ... >> + >> +Note, that the code also supports "self" checkpoint, where a process > > Note that > >> +can checkpoint itself. This mode does not capture the relationships of >> +the task with other tasks, or any shared resources. It is useful for >> +application that wish to be able to save and restore their state. > > applications > >> +They will either not use (or care about) shared resources, or they >> +will be aware of the operations and adapt suitably after a restart. >> +The code above can also be used for "self" checkpoint. >> + >> + >> +You may find the following sample programs useful: >> + >> +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout > > checkpoints > >> +* self_checkpoint.c: a simple test program doing self-checkpoint >> +* self_restart.c: restarts a (self-) checkpoint image from stdin >> + >> +See also the utilities 'checkpoint' and 'restart' (from user-cr). >> + >> + >> +"External" checkpoint >> +===================== >> + >> +To do "external" checkpoint, you need to first freeze that other task >> +either using the freezer cgroup. > > eh? cannot parse that. > >> + >> +Restart does not preserve the original PID yet, (because we haven't >> +solved yet the fork-with-specific-pid issue). In a real scenario, you >> +probably want to first create a new names space, and have the init > > namespace, > >> +task there call 'sys_restart()'. >> + >> +I tested it this way: > > ... > > --- > ~Randy > *** Remember to use Documentation/SubmitChecklist when testing your code *** > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html