On 03/03/11 10:38 -0500, Oren Laadan wrote: > > > On 03/01/2011 10:31 AM, Louis Rilling wrote: > > On 28/02/11 17:10 -0500, Oren Laadan wrote: > >> > >> > >> On 02/28/2011 10:09 AM, Louis Rilling wrote: > >>> On 28/02/11 9:43 -0500, Oren Laadan wrote: > >>>> > >>>> > >>>> On 02/26/2011 08:54 AM, Louis Rilling wrote: > >>>>> On Fri, Feb 25, 2011 at 11:01:32AM -0800, Sukadev Bhattiprolu wrote: > >>>>>> Louis Rilling [Louis.Rilling@xxxxxxxxxxx] wrote: > >>>>>> | On 24/02/11 23:55 -0800, Sukadev Bhattiprolu wrote: > >>>>>> | > > >>>>>> | > diff --git a/kernel/checkpoint/restart.c b/kernel/checkpoint/restart.c > >>>>>> | > index b0ea8ec..8ecc052 100644 > >>>>>> | > --- a/kernel/checkpoint/restart.c > >>>>>> | > +++ b/kernel/checkpoint/restart.c > >>>>>> | > @@ -972,6 +972,7 @@ static int do_ghost_task(void) > >>>>>> | > if (ret < 0) > >>>>>> | > ckpt_err(ctx, ret, "ghost restart failed\n"); > >>>>>> | > > >>>>>> | > + current->exit_signal = -1; > >>>>>> | > >>>>>> | Setting ->exit_signal outside of tasklist_lock makes me nervous. All other > >>>>>> | places that change ->exit_signal hold write_lock_irq(&tasklist_lock), and > >>>>>> | eligibile_child() (for an instance of a reader being another task) holds > >>>>>> | read_lock(&tasklist_lock). But maybe this does not matter for ghost tasks. > >>>>>> | > >>>>>> > >>>>>> Yes, an earlier version had the write_lock(&tasklist_lock). Will add it > >>>>>> back. > >>>>>> > >>>>>> | > restore_debug_exit(ctx); > >>>>>> | > ckpt_ctx_put(ctx); > >>>>>> | > do_exit(0); > >>>>>> | > @@ -1465,7 +1466,22 @@ void exit_checkpoint(struct task_struct *tsk) > >>>>>> | > /* restarting zombies will activate next task in restart */ > >>>>>> | > if (tsk->flags & PF_RESTARTING) { > >>>>>> | > BUG_ON(ctx->active_pid == -1); > >>>>>> | > + > >>>>>> | > + /* > >>>>>> | > + * if we are a "ghost" task, that was terminated by the > >>>>>> | > + * container-init (from zap_pid_ns_processes()), we should > >>>>>> | > + * wake up the parent since we are now a detached process. > >>>>>> | > + */ > >>>>>> | > + read_lock_irq(&tasklist_lock); > >>>>>> | > >>>>>> | read_lock() is enough. tasklist_lock is never taken for write from IRQs or > >>>>>> | softIRQs. > >>>>>> | > >>>>>> | > + if (tsk->exit_state == EXIT_DEAD && !tsk->parent->exit_state) { > >>>>>> | > + ckpt_debug("[%d, %s]: exit_checkpoint(): notifying " > >>>>>> | > + "parent\n", tsk->pid, tsk->comm); > >>>>>> | > + __wake_up_parent(tsk, tsk->parent); > >>>>>> | > + } > >>>>>> | > + read_unlock_irq(&tasklist_lock); > >>>>>> | > >>>>>> | Looking at this closer, I wonder if this wakeup logic should be called from > >>>>>> | do_ghost_task(), right after setting ->exit_signal. This way there would be no > >>>>>> | need for a tricky condition to recognize ghost tasks, and (I think) this is > >>>>>> | closer to the other cases changing ->exit_signal (reparent_leader() and > >>>>>> | exit_notify()). > >>>>>> > >>>>>> Yes, we tried the following in the earlier version. > >>>>>> > >>>>>> void ghost_auto_reapable() > >>>>>> { > >>>>>> write_lock(&tasklist_lock); > >>>>>> current->exit_signal = -1; > >>>>>> __wake_up_parent(current, current->parent); > >>>>>> write_unlock(&tasklist_lock); > >>>>>> } > >>>>>> > >>>>>> And called this from do_ghost_task(). But with this, the parent could > >>>>>> wake up, find that it still has an eligible child (the ghost) to wait > >>>>>> for, and go back to waiting before the ghost enters the EXIT_DEAD state. > >>>>>> And so we would lose the wake up. > >>>>>> > >>>>>> (zap_pid_ns_processes() passes the __WALL so the ghost would be considered > >>>>>> an eligible child). > >>>>> > >>>>> I think I see now. The point is that ->exit_signal = -1 is only meant to work > >>>>> correctly for sub-threads, which the parent does not need to wait for. IOW, the > >>>>> notion of detached task is only implemented for sub-threads. > >>>>> > >>>>> IIUC, setting ->exit_signal to -1 is only used here to let exit_notify() set > >>>>> ->exit_state to EXIT_DEAD, right? Otherwise, setting ->exit_signal to 0 and > >>>>> letting do_notify_parent() proceed for ghost tasks would have be sufficient I > >>>>> guess (provided that the confusion between ghost tasks and zombies could be > >>>>> easily avoided in do_notify_parent()). > >>>>> > >>>>> Then I agree that the proposed patch looks like a reasonably simple approach. > >>>>> > >>>>> Thanks for the explanation, > >>>>> > >>>> > >>>> Louis, Suka: > >>>> > >>>> One subtlety with the method is that if a process get reparented > >>>> (for whatever reason) then the ->exit_signal field is reset to > >>>> SIGCHLD. Fortunately, that should not affect us because our ghost > >>>> tasks never become orphaned. > >>> > >>> AFAICS, this is not true for detached tasks: neither in reparent_leader(), nor > >>> in exit_notify(). Unless I missed another place? > >> > >> Doh.. you are totally right, I missed that. > >> Sometime it feels really good to be wrong :) > > > > ;) > > > >> > >>> > >>>> > >>>> However, I can't avoid thinking that maybe there is a better way > >>>> to do this altogether ? > >>>> > >>>> The requirement is simple: ghost tasks are temporary tasks whose > >>>> role is to keep certain pid's alive for the duration of the restart > >>>> and then exit without a trace before the restarted tasks resume > >>>> execution. > >>>> > >>>> The reason I opted for the ->exit_signal = -1 is because it makes > >>>> sure that the parent need not explicitly collect the child ghost. > >>>> > >>>> If I had set it to 0, then it would not send a signal, but still > >>>> would require a wait() to collect it (right ?). > >>> > >>> Right. > >>> > >>>> > >>>> Can you think of a way to achieve this functionality without the > >>>> subtleties that we have observed so far ? even at the cost of a > >>>> minor change to, say, wait() logic or what not ? > >>> > >>> I wonder if things would be easier by providing a mean to distinguish ghost > >>> tasks from truely restarted tasks in do_notify_parent(). > >>> > >>> Assuming this, there might be a kernel-patch-free way, although I can't say if > >>> this fits userspace restart constraints: > >>> > >>> Setting the signal handler of ghost tasks' parent to SIG_IGN, and ->exit_signal > >> > >> Yes, I thought about it. But those parent are part of the restart, > >> and changing their signal handlers will is tricky and will require > >> complex sync logic at the end of restart to ensure that all tasks > >> restore their handlers after the ghosts are gone and released, but > >> before any task gets to userspace ... I don't want to go there. > > > > Yes, I can imagine well the added complexity in userspace. > > > >> > >>> of ghost tasks to SIGCHLD, will make them autoreap. This requires that the > >>> parent does not need to synchronize with other children until all ghost tasks > >>> have exited, and that the parent remains alive too. > >>> > >>> So, why not adding some flag PF_RESTART_GHOST or TIF_RESTART_GHOST? > >> > >> Actually, we already have a way to distinguish them: > >> > >> if ((tsk->flags & PF_RESTARTING) && task_detached(tsk)) > >> > >> One problem with this is that we only set exit_signal to -1 after > >> the ghost was born, so if the parent is already waiting I think > >> that will be racy, as per suka's comment above. > > > > Yes. > > > >> > >> So looking at the code again, we could add one condition in exit.c > >> at wait_consider_task(), after the test of p->exit_state == EXIT_DEAD, > >> to also test: > >> > >> inline static bool is_ghost_task(p) > >> { > >> return (p->flags & (PF_EXITING|PF_RESTARTING) == > >> PF_EXITING|PF_RESTARTING) && task_detached(p) > >> } > >> > >> if (p->flags & is_ghost_task(p)) > >> return 0; > >> > >> Or something along the lines (e.g. used EXIT_ZOMBIE comparison instead > >> of PF_EXITING). While requiring a kernel patch, it is relatively short, > >> clean and easy to review. EXIT_ZOMBIE comparison would not optimize much imho, since p->flags must be checked anyway. Nit1: I don't think that checking p->flags saves anything before calling is_ghost_task(). Nit2: why would you like to check that PF_EXITING and PF_RESTARTING come together? Is it to make sure that no "real" restarted thread will be skipped this way? > > > > Adding the check in do_wait() is racy IMHO. What does prevent do_ghost_task() > > from setting ->exit_signal = -1 after the parent sleeps in do_wait()? > > If we do it per your suggestion (and I also had suggested the > same originally): > > >>>>>> void ghost_auto_reapable() > >>>>>> { > >>>>>> write_lock(&tasklist_lock); > >>>>>> current->exit_signal = -1; > >>>>>> __wake_up_parent(current, current->parent); > >>>>>> write_unlock(&tasklist_lock); > >>>>>> } > > then it should be safe, no ? s/write_lock/write_lock_irq/ and I'll call it safe :) > > > > > Otherwise, why not making ghost tasks simply sub-threads of the "parent"? They > > would autoreap and nobody would wait or have to wait for them. I would feel so > > much better if we had not to change ->exit_signal in do_ghost_task()... > > There are two types of ghosts: > > 1) A placeholder needed to ensure that some processes are orphaned > correctly (while retaining their sid) > > 2) A placeholder needed to ensure that a certain pid exists during > the restart (and can be, e.g. used by other processes as pgid). > > in both cases, a thread of the parent won't do what I need: in the > first case, the child will be re-parented to another thread (leader?). any other thread, possibly the leader. Right. > And in the second, the pgid will be be usable as a pgid. You probably meant that the pid would not be usable as a pgid (because not group leader pid, possible session mismatch, etc.). > > But please keep throwing suggestions, we're making progress here... > (Just had a closer look to how pgids are restored) Well, given that it's useful to have a consistent pgid/session for ghost tasks, and that it looks too difficult to rework the synchronization logic before reaping ghost tasks (referring to the ->exit_signal = 0 + parent sighandler = SIG_IGN suggestion), I'm falling short of cleaner suggestions :) The ghost_auto_reapable() thing above + wait_consider_task() change looks like the cleanest solution so far. My only fear is that people could be reluctant to adding this (small?) overhead to do_wait() "only" for cr. Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Attachment:
signature.asc
Description: Digital signature
_______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers