Re: [RFC v14-rc2][PATCH 15/29] Restart multiple processes

Sukadev Bhattiprolu <sukadev@xxxxxxxxxxxxxxxxxx> · Mon, 6 Apr 2009 20:33:15 -0700

Couple of nits and couple of not-so minor comments 

Oren Laadan [orenl@xxxxxxxxxxxxxxx] wrote:
| From 7162fef93ee3d9fd30a457dd7b0c7ad0200d5bcb Mon Sep 17 00:00:00 2001
| From: Oren Laadan <orenl@xxxxxxxxxxxxxxx>
| Date: Mon, 30 Mar 2009 15:06:13 -0400
| Subject: [PATCH 15/29] Restart multiple processes
| 
| Restarting of multiple processes expects all restarting tasks to call
| sys_restart(). Once inside the system call, each task will restart
| itself at the same order that they were saved. The internals of the
| syscall will take care of in-kernel synchronization bewteen tasks.
| 
| This patch does _not_ create the task tree in the kernel. Instead it
| assumes that all tasks are created in some way and then invoke the
| restart syscall. You can use the userspace mktree.c program to do
| that.
| 
| The init task (*) has a special role: it allocates the restart context
| (ctx), and coordinates the operation. In particular, it first waits
| until all participating tasks enter the kernel, and provides them the
| common restart context. Once everyone in ready, it begins to restart
| itself.
| 
| In contrast, the other tasks enter the kernel, locate the init task (*)
| and grab its restart context, and then wait for their turn to restore.
| 
| When a task (init or not) completes its restart, it hands the control
| over to the next in line, by waking that task.
| 
| An array of pids (the one saved during the checkpoint) is used to
| synchronize the operation. The first task in the array is the init
| task (*). The restart context (ctx) maintain a "current position" in
| the array, which indicates which task is currently active. Once the
| currently active task completes its own restart, it increments that
| position and wakes up the next task.
| 
| Restart assumes that userspace provides meaningful data, otherwise
| it's garbage-in-garbage-out. In this case, the syscall may block
| indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
| otherwise kill the stray restarting tasks.
| 
| In terms of security, restart runs as the user the invokes it, so it
| will not allow a user to do more than is otherwise permitted by the
| usual system semantics and policy.
| 
| Currently we ignore threads and zombies, as well as session ids.
| Add support for multiple processes
| 
| (*) For containers, restart should be called inside a fresh container
| by the init task of that container. However, it is also possible to
| restart applications not necessarily inside a container, and without
| restoring the original pids of the processes (that is, provided that
| the application can tolerate such behavior). This is useful to allow
| multi-process restart of tasks not isolated inside a container, and
| also for debugging.
| 
| Changelog[v14]:
|   - Revert change to pr_debug(), back to cr_debug()
|   - Discard field 'h.parent'
|   - Check whether calls to cr_hbuf_get() fail
| 
| Changelog[v13]:
|   - Clear root_task->checkpoint_ctx regardless of error condition
|   - Remove unused argument 'ctx' from do_restart_task() prototype
|   - Remove unused member 'pids_err' from 'struct cr_ctx'
| 
| Changelog[v12]:
|   - Replace obsolete cr_debug() with pr_debug()
| 
| Signed-off-by: Oren Laadan <orenl@xxxxxxxxxxxxxxx>
| Acked-by: Serge Hallyn <serue@xxxxxxxxxx>
| ---
|  checkpoint/restart.c       |  224 +++++++++++++++++++++++++++++++++++++++++++-
|  checkpoint/sys.c           |   34 ++++++--
|  include/linux/checkpoint.h |   24 ++++-
|  include/linux/sched.h      |    4 +
|  4 files changed, 272 insertions(+), 14 deletions(-)
| 
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| index 96d4d45..adebc1c 100644
| --- a/checkpoint/restart.c
| +++ b/checkpoint/restart.c
| @@ -10,6 +10,7 @@
|  
|  #include <linux/version.h>
|  #include <linux/sched.h>
| +#include <linux/wait.h>
|  #include <linux/file.h>
|  #include <linux/magic.h>
|  #include <linux/checkpoint.h>
| @@ -301,30 +302,245 @@ static int cr_read_task(struct cr_ctx *ctx)
|  	return ret;
|  }
|  
| +/* cr_read_tree - read the tasks tree into the checkpoint context */
| +static int cr_read_tree(struct cr_ctx *ctx)
| +{
| +	struct cr_hdr_tree *hh;
| +	int size, ret;
| +
| +	hh = cr_hbuf_get(ctx, sizeof(*hh));
| +	if (!hh)
| +		return -ENOMEM;
| +
| +	ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
| +	if (ret < 0)
| +		goto out;
| +
| +	ret = -EINVAL;
| +	if (hh->tasks_nr < 0)
| +		goto out;
| +
| +	ctx->pids_nr = hh->tasks_nr;
| +	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
| +	if (size < 0)		/* overflow ? */
| +		goto out;
| +
| +	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
| +	if (!ctx->pids_arr) {
| +		ret = -ENOMEM;
| +		goto out;
| +	}
| +	ret = cr_kread(ctx, ctx->pids_arr, size);
| + out:
| +	cr_hbuf_put(ctx, sizeof(*hh));
| +	return ret;
| +}
| +
| +static int cr_wait_task(struct cr_ctx *ctx)
| +{
| +	pid_t pid = task_pid_vnr(current);
| +
| +	cr_debug("pid %d waiting\n", pid);
| +	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
| +}
| +
| +static int cr_next_task(struct cr_ctx *ctx)
| +{
| +	struct task_struct *tsk;
| +
| +	ctx->pids_pos++;
| +
| +	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
| +	if (ctx->pids_pos == ctx->pids_nr) {
| +		complete(&ctx->complete);
| +		return 0;
| +	}
| +
| +	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
| +
| +	cr_debug("pids_next %d\n", ctx->pids_active);
| +
| +	rcu_read_lock();
| +	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
| +	if (tsk)
| +		wake_up_process(tsk);
| +	rcu_read_unlock();
| +
| +	if (!tsk) {
| +		complete(&ctx->complete);
| +		return -ESRCH;
| +	}
| +
| +	return 0;
| +}
| +
| +/* FIXME: this should be per container */
| +DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
| +
| +static int do_restart_task(pid_t pid)
| +{
| +	struct task_struct *root_task;
| +	struct cr_ctx *ctx = NULL;
| +	int ret;
| +
| +	rcu_read_lock();
| +	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
| +	if (root_task)
| +		get_task_struct(root_task);
| +	rcu_read_unlock();
| +
| +	if (!root_task)
| +		return -EINVAL;
| +
| +	/*
| +	 * wait for container init to initialize the restart context, then
| +	 * grab a reference to that context, and if we're the last task to
| +	 * do it, notify the container init.
| +	 */
| +	ret = wait_event_interruptible(cr_restart_waitq,
| +				       root_task->checkpoint_ctx);
| +	if (ret < 0)
| +		goto out;
| +
| +	task_lock(root_task);
| +	ctx = root_task->checkpoint_ctx;
| +	if (ctx)
| +		cr_ctx_get(ctx);
| +	task_unlock(root_task);
| +
| +	if (!ctx) {
| +		ret = -EAGAIN;
| +		goto out;
| +	}
| +
| +	if (atomic_dec_and_test(&ctx->tasks_count))
| +		complete(&ctx->complete);
| +
| +	/* wait for our turn, do the restore, and tell next task in line */
| +	ret = cr_wait_task(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_task(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_next_task(ctx);
| +
| + out:
| +	cr_ctx_put(ctx);
| +	put_task_struct(root_task);
| +	return ret;
| +}
| +
| +/**
| + * cr_wait_all_tasks_start - wait for all tasks to enter sys_restart()
| + * @ctx: checkpoint context
| + *
| + * Called by the container root to wait until all restarting tasks
| + * are ready to restore their state. Temporarily advertises the 'ctx'
| + * on 'current->checkpoint_ctx' so that others can grab a reference
| + * to it, and clears it once synchronization completes. See also the
| + * related code in do_restart_task().
| + */
| +static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
| +{
| +	int ret;
| +
| +	if (ctx->pids_nr == 1)
| +		return 0;
| +
| +	init_completion(&ctx->complete);
| +	current->checkpoint_ctx = ctx;
| +
| +	wake_up_all(&cr_restart_waitq);
| +
| +	ret = wait_for_completion_interruptible(&ctx->complete);
| +
| +	task_lock(current);
| +	current->checkpoint_ctx = NULL;
| +	task_unlock(current);
| +
| +	return ret;
| +}
| +
| +static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
| +{
| +	int ret;
| +
| +	if (ctx->pids_nr == 1)
| +		return 0;
| +
| +	init_completion(&ctx->complete);
| +
| +	ret = cr_next_task(ctx);
| +	if (ret < 0)
| +		return ret;
| +
| +	ret = wait_for_completion_interruptible(&ctx->complete);
| +	if (ret < 0)
| +		return ret;
| +
| +	return 0;
| +}
| +
|  /* setup restart-specific parts of ctx */
|  static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
|  {
| +	ctx->root_pid = pid;
| +	ctx->root_task = current;
| +	ctx->root_nsproxy = current->nsproxy;
| +
| +	get_task_struct(ctx->root_task);
| +	get_nsproxy(ctx->root_nsproxy);
| +
| +	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
| +
|  	return 0;
|  }
|  
| -int do_restart(struct cr_ctx *ctx, pid_t pid)
| +static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
|  {
|  	int ret;
|  
| +	ret = cr_read_head(ctx);
| +	if (ret < 0)
| +		goto out;
| +	ret = cr_read_tree(ctx);
| +	if (ret < 0)
| +		goto out;
| +
|  	ret = cr_ctx_restart(ctx, pid);
|  	if (ret < 0)
|  		goto out;
| -	ret = cr_read_head(ctx);
| +
| +	/* wait for all other tasks to enter do_restart_task() */
| +	ret = cr_wait_all_tasks_start(ctx);
|  	if (ret < 0)
|  		goto out;
| +
|  	ret = cr_read_task(ctx);
|  	if (ret < 0)
|  		goto out;
| -	ret = cr_read_tail(ctx);
| +
| +	/* wait for all other tasks to complete do_restart_task() */
| +	ret = cr_wait_all_tasks_finish(ctx);
|  	if (ret < 0)
|  		goto out;
|  
| -	/* on success, adjust the return value if needed [TODO] */
| +	ret = cr_read_tail(ctx);
| +
|   out:
|  	return ret;
|  }
| +
| +int do_restart(struct cr_ctx *ctx, pid_t pid)
| +{
| +	int ret;
| +
| +	if (ctx)
| +		ret = do_restart_root(ctx, pid);
| +	else
| +		ret = do_restart_task(pid);
| +
| +	/* on success, adjust the return value if needed [TODO] */
| +	return ret;
| +}
| diff --git a/checkpoint/sys.c b/checkpoint/sys.c
| index 8630144..3a925ae 100644
| --- a/checkpoint/sys.c
| +++ b/checkpoint/sys.c
| @@ -167,6 +167,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
|  
|  static void cr_ctx_free(struct cr_ctx *ctx)
|  {
| +	BUG_ON(atomic_read(&ctx->refcount));
| +
|  	if (ctx->file)
|  		fput(ctx->file);
|  
| @@ -185,6 +187,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
|  	if (ctx->root_task)
|  		put_task_struct(ctx->root_task);
|  
| +	kfree(ctx->pids_arr);
| +
|  	kfree(ctx);
|  }
|  
| @@ -199,8 +203,10 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|  
|  	ctx->flags = flags;
|  
| +	atomic_set(&ctx->refcount, 0);
|  	INIT_LIST_HEAD(&ctx->pgarr_list);
|  	INIT_LIST_HEAD(&ctx->pgarr_pool);
| +	init_waitqueue_head(&ctx->waitq);
|  
|  	err = -EBADF;
|  	ctx->file = fget(fd);
| @@ -215,6 +221,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|  	if (cr_objhash_alloc(ctx) < 0)
|  		goto err;
|  
| +	atomic_inc(&ctx->refcount);
|  	return ctx;
|  
|   err:
| @@ -222,6 +229,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
|  	return ERR_PTR(err);
|  }
|  
| +void cr_ctx_get(struct cr_ctx *ctx)
| +{
| +	atomic_inc(&ctx->refcount);
| +}
| +
| +void cr_ctx_put(struct cr_ctx *ctx)
| +{
| +	if (ctx && atomic_dec_and_test(&ctx->refcount))
| +		cr_ctx_free(ctx);
| +}
| +
|  /**
|   * sys_checkpoint - checkpoint a container
|   * @pid: pid of the container init(1) process
| @@ -251,7 +269,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
|  	if (!ret)
|  		ret = ctx->crid;
|  
| -	cr_ctx_free(ctx);
| +	cr_ctx_put(ctx);
|  	return ret;
|  }
|  
| @@ -266,7 +284,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
|   */
|  asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
|  {
| -	struct cr_ctx *ctx;
| +	struct cr_ctx *ctx = NULL;
|  	pid_t pid;
|  	int ret;
|  
| @@ -274,15 +292,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
|  	if (flags)
|  		return -EINVAL;
|  
| -	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
| -	if (IS_ERR(ctx))
| -		return PTR_ERR(ctx);
| -
|  	/* FIXME: for now, we use 'crid' as a pid */
|  	pid = (pid_t) crid;
|  
| +	if (pid == task_pid_vnr(current))
| +		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
| +
| +	if (IS_ERR(ctx))
| +		return PTR_ERR(ctx);
| +
|  	ret = do_restart(ctx, pid);
|  
| -	cr_ctx_free(ctx);
| +	cr_ctx_put(ctx);
|  	return ret;
|  }
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index c946320..cede30e 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -12,8 +12,11 @@
|  
|  #include <linux/path.h>
|  #include <linux/fs.h>
| +#include <linux/path.h>
| +#include <linux/sched.h>
| +#include <asm/atomic.h>
|  
| -#define CR_VERSION  2
| +#define CR_VERSION  3
|  
|  struct cr_ctx {
|  	int crid;		/* unique checkpoint id */
| @@ -31,8 +34,7 @@ struct cr_ctx {
|  	void *hbuf;		/* temporary buffer for headers */
|  	int hpos;		/* position in headers buffer */
|  
| -	struct task_struct **tasks_arr;	/* array of all tasks in container */
| -	int tasks_nr;			/* size of tasks array */
| +	atomic_t refcount;
|  
|  	struct cr_objhash *objhash;	/* hash for shared objects */
|  
| @@ -40,6 +42,19 @@ struct cr_ctx {
|  	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
|  
|  	struct path fs_mnt;	/* container root (FIXME) */
| +
| +	/* [multi-process checkpoint] */
| +	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
| +	int tasks_nr;                   /* size of tasks array */
| +
| +	/* [multi-process restart] */
| +	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
| +	int pids_nr;			/* size of pids array */

Nit: Since we already have a pid_nr() that refers to something different,
can we call this 'nr_pids' (and nr_tasks above)  like mm_context->nr_threads ?
Of course, there is no convention, so its easy to argue the other way.

Secondly, isn't pids_nr same as tasks_nr ? If so do we need both ?

Or is this intended to address the issue of multiple pid_nr values that a
task in a nested container can have ? If so, pids_nr is > tasks_nr and that
brings up two comments :-)

First, mktree.c and cr_next_task() are using 'ctx->pids_nr' to determine how
many tasks to start. If we are talking about nested containers, pids_nr
will be greater than tasks_nr so, mktree and cr_next_task() should be
use 'ctx->tasks_nr' to determine how many tasks to create. Also if
checkpointing a nested container we should view the multiple nested pid
values a process as an attribute of the task and maybe save them in
cr_write_task() rather than in cr_write_tree().

My second comment is more an orthogonal question. Suppose init_pid_ns = level
 0 and we have a container that is nested at level 3.  If we checkpoint just
this container, we would want to be able to restore this container at any level
 > 0 right ?

| +	int pids_pos;			/* position pids array */
| +	pid_t pids_active;		/* pid of (next) active task */

Do we need both pids_pos and pids_active in the ctx ? Can pids_active
just be a local variable in cr_next_task() and cr_wait_task() ?
IOW, isn't this always true

	pids_arr[pids_pos] == pids_active

| +	atomic_t tasks_count;		/* sync of tasks: used to coordinate */

Name is a bit confusing with 'tasks_nr', but the comment helps and I can't
think of a better name.

| +	struct completion complete;	/* container root and other tasks on */
| +	wait_queue_head_t waitq;	/* start, end, and restart ordering */
|  };

Sukadev
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers