On Sat, 1 May 2010 10:15:02 -0400 Oren Laadan wrote: > Covers application checkpoint/restart, overall design, interfaces, > usage, shared objects, and and checkpoint image format. > > Signed-off-by: Oren Laadan <orenl@xxxxxxxxxxxxxxx> > Signed-off-by: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx> > Acked-by: Serge E. Hallyn <serue@xxxxxxxxxx> > Tested-by: Serge E. Hallyn <serue@xxxxxxxxxx> > --- > Documentation/checkpoint/checkpoint.c | 38 +++ > Documentation/checkpoint/readme.txt | 370 ++++++++++++++++++++++++++++ > Documentation/checkpoint/self_checkpoint.c | 69 +++++ > Documentation/checkpoint/self_restart.c | 40 +++ > Documentation/checkpoint/usage.txt | 247 +++++++++++++++++++ > 5 files changed, 764 insertions(+), 0 deletions(-) > create mode 100644 Documentation/checkpoint/checkpoint.c > create mode 100644 Documentation/checkpoint/readme.txt > create mode 100644 Documentation/checkpoint/self_checkpoint.c > create mode 100644 Documentation/checkpoint/self_restart.c > create mode 100644 Documentation/checkpoint/usage.txt > diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt > new file mode 100644 > index 0000000..4fa5560 > --- /dev/null > +++ b/Documentation/checkpoint/readme.txt > @@ -0,0 +1,370 @@ > + ... > +In contrast, when checkpointing a subtree of a container it is up to > +the user to ensure that dependencies either don't exist or can be > +safely ignored. This is useful, for instance, for HPC scenarios or > +even a user that would like to periodically checkpoint a long-running who > +batch job. > + ... > + > +Checkpoint image format > +======================= > + ... > + > +The container configuration section containers information that is contains > +global to the container. Security (LSM) configuration is one example. > +Network configuration and container-wide mounts may also go here, so > +that the userspace restart coordinator can re-create a suitable > +environment. > + ... > + > +Then the state of all tasks is saved, in the order that they appear in > +the tasks array above. For each state, we save data like task_struct, > +namespaces, open files, memory layout, memory contents, cpu state, CPU (throughout, please) > +signals and signal handlers, etc. For resources that are shared among > +multiple processes, we first checkpoint said resource (and only once), > +and in the task data we give a reference to it. More about shared > +resources below. > + ... > + > +Shared objects > +============== > + > +Many resources may be shared by multiple tasks (e.g. file descriptors, > +memory address space, etc), or even have multiple references from etc.), > +other resources (e.g. a single inode that represents two ends of a > +pipe). > + ... > +Memory contents format > +====================== > + > +The memory contents of a given memory address space (->mm) is dumped are (I think) > +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'. > +This header details the vma properties, and a reference to a file > +(if file backed) or an inode (or shared memory) object. > + > +The vma header is followed by the actual contents - but only those > +pages that need to be saved, i.e. dirty pages. They are written in > +chunks of data, where each chunks contains a header that indicates chunk > +that number of pages in the chunk, followed by an array of virtual the > +addresses and then an array of actual page contents. The last chunk > +holds zero pages. > + ... > +Kernel interfaces > +================= > + > +* To checkpoint a vma, the 'struct vm_operations_struct' needs to > + provide a method ->checkpoint: > + int checkpoint(struct ckpt_ctx *, struct vma_struct *) > + Restart requires a matching (exported) restore: > + int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *) > + > +* To checkpoint a file, the 'struct file_operations' needs to provide > + the methods ->checkpoint and ->collect: > + int checkpoint(struct ckpt_ctx *, struct file *) > + int collect(struct ckpt_ctx *, struct file *) > + Restart requires a matching (exported) restore: > + int restore(struct ckpt_ctx *, struct ckpt_hdr_file *) > + For most file systems, generic_file_{checkpoint,restore}() can be > + used. > + > +* To checkpoint a socket, the 'struct proto_ops' needs to provide To checkpoint/restart a socket, > + the methods ->checkpoint, ->collect and ->restore: > + int checkpoint(struct ckpt_ctx *ctx, struct socket *sock); > + int collect(struct ckpt_ctx *ctx, struct socket *sock); > + int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h) > diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt > new file mode 100644 > index 0000000..c6fc045 > --- /dev/null > +++ b/Documentation/checkpoint/usage.txt > @@ -0,0 +1,247 @@ > + > + How to use Checkpoint-Restart > + ========================================= > + > + > +API > +=== > + > +The API consists of three new system calls: > + > +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd); flags, > + > + Checkpoint a (sub-)container whose root task is identified by @pid, > + to the open file indicated by @fd. If @logfd isn't -1, it indicates > + an open file to which error and debug messages are written. @flags > + may be one or more of: > + - CHECKPOINT_SUBTREE : allow checkpoint of sub-container > + (other value are not allowed). > + > + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if > + it returns from a restart, and -1 if an error occurs. The ckptid will > + uniquely identify a checkpoint image, for as long as the checkpoint > + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a > + partial checkpoint, residing in kernel memory). > + > +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd); > + > + Restart a process hierarchy from a checkpoint image that is read from > + the blob stored in the file indicated by @fd. If @logfd isn't -1, it > + indicates an open file to which error and debug messages are written. > + @flags will have future meaning (must be 0 for now). @pid indicates > + the root of the hierarchy as seen in the coordinator's pid-namespace, > + and is expected to be a child of the coordinator. @flags may be one > + or more of: > + - RESTART_TASKSELF : (self) restart of a single process > + - RESTART_FROEZN : processes remain frozen once restart completes FROZEN ? > + - RESTART_GHOST : process is a ghost (placeholder for a pid) about @flags: Above says both of these: a) @flags will have future meaning (must be 0 for now) b) @flags may be one or more of: so please decide which one it is ;) > + (Note that this argument may mean 'ckptid' to identify an in-kernel > + checkpoint image, with some @flags in the future). > + > + Returns: -1 if an error occurs, 0 on success when restarting from a > + "self" checkpoint, and return value of system call at the time of the > + checkpoint when restarting from an "external" checkpoint. > + ... > + > +Sysctl/proc > +=========== > + > +/proc/sys/kernel/ckpt_unpriv_allowed [default = 1] > + controls whether c/r operation is allowed for unprivileged users C/R > + > + > +Operation > +========= > + > +The granularity of a checkpoint usually is a process hierarchy. The > +'pid' argument is interpreted in the caller's pid namespace. So to > +checkpoint a container whose init task (pid 1 in that pidns) appears > +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing > +pid 1 will attempt to checkpoint the caller's container, and if the > +caller isn't privileged and init is owned by root, it will fail. > + > +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid > +which does not refer to a container's init task, then sys_checkpoint() > +would return -EINVAL. returns -EINVAL. ... > + > + > +User tools > +========== > + > +* checkpoint(1): a tool to perform a checkpoint of a container/subtree > +* restart(1): a tool to restart a container/subtree > +* ckptinfo: a tool to examine a checkpoint image > + > +It is best to use the dedicated user tools for checkpoint and restart. > + > +If you insist, then here is a code snippet that illustrates how a > +checkpoint is initiated by a process inside a container - the logic is > +similar to fork(): > + ... > + ckptid = checkpoint(0, ...); > + switch (crid) { (ckptid) ? > + case -1: > + perror("checkpoint failed"); > + break; > + default: > + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret); s/ret/ckptid/ ? > + /* proceed with execution after checkpoint */ > + ... > + break; > + case 0: > + fprintf(stderr, "returned after restart\n"); > + /* proceed with action required following a restart */ > + ... > + break; > + } > + ... > + > +And to initiate a restart, the process in an empty container can use > +logic similar to execve(): > + ... > + if (restart(pid, ...) < 0) > + perror("restart failed"); > + /* only get here if restart failed */ > + ... > + > +Note, that the code also supports "self" checkpoint, where a process Note that > +can checkpoint itself. This mode does not capture the relationships of > +the task with other tasks, or any shared resources. It is useful for > +application that wish to be able to save and restore their state. applications > +They will either not use (or care about) shared resources, or they > +will be aware of the operations and adapt suitably after a restart. > +The code above can also be used for "self" checkpoint. > + > + > +You may find the following sample programs useful: > + > +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout checkpoints > +* self_checkpoint.c: a simple test program doing self-checkpoint > +* self_restart.c: restarts a (self-) checkpoint image from stdin > + > +See also the utilities 'checkpoint' and 'restart' (from user-cr). > + > + > +"External" checkpoint > +===================== > + > +To do "external" checkpoint, you need to first freeze that other task > +either using the freezer cgroup. eh? cannot parse that. > + > +Restart does not preserve the original PID yet, (because we haven't > +solved yet the fork-with-specific-pid issue). In a real scenario, you > +probably want to first create a new names space, and have the init namespace, > +task there call 'sys_restart()'. > + > +I tested it this way: ... --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html