On 03/01/2011 10:31 AM, Louis Rilling wrote: > On 28/02/11 17:10 -0500, Oren Laadan wrote: >> >> >> On 02/28/2011 10:09 AM, Louis Rilling wrote: >>> On 28/02/11 9:43 -0500, Oren Laadan wrote: >>>> >>>> >>>> On 02/26/2011 08:54 AM, Louis Rilling wrote: >>>>> On Fri, Feb 25, 2011 at 11:01:32AM -0800, Sukadev Bhattiprolu wrote: >>>>>> Louis Rilling [Louis.Rilling@xxxxxxxxxxx] wrote: >>>>>> | On 24/02/11 23:55 -0800, Sukadev Bhattiprolu wrote: >>>>>> | > >>>>>> | > diff --git a/kernel/checkpoint/restart.c b/kernel/checkpoint/restart.c >>>>>> | > index b0ea8ec..8ecc052 100644 >>>>>> | > --- a/kernel/checkpoint/restart.c >>>>>> | > +++ b/kernel/checkpoint/restart.c >>>>>> | > @@ -972,6 +972,7 @@ static int do_ghost_task(void) >>>>>> | > if (ret < 0) >>>>>> | > ckpt_err(ctx, ret, "ghost restart failed\n"); >>>>>> | > >>>>>> | > + current->exit_signal = -1; >>>>>> | >>>>>> | Setting ->exit_signal outside of tasklist_lock makes me nervous. All other >>>>>> | places that change ->exit_signal hold write_lock_irq(&tasklist_lock), and >>>>>> | eligibile_child() (for an instance of a reader being another task) holds >>>>>> | read_lock(&tasklist_lock). But maybe this does not matter for ghost tasks. >>>>>> | >>>>>> >>>>>> Yes, an earlier version had the write_lock(&tasklist_lock). Will add it >>>>>> back. >>>>>> >>>>>> | > restore_debug_exit(ctx); >>>>>> | > ckpt_ctx_put(ctx); >>>>>> | > do_exit(0); >>>>>> | > @@ -1465,7 +1466,22 @@ void exit_checkpoint(struct task_struct *tsk) >>>>>> | > /* restarting zombies will activate next task in restart */ >>>>>> | > if (tsk->flags & PF_RESTARTING) { >>>>>> | > BUG_ON(ctx->active_pid == -1); >>>>>> | > + >>>>>> | > + /* >>>>>> | > + * if we are a "ghost" task, that was terminated by the >>>>>> | > + * container-init (from zap_pid_ns_processes()), we should >>>>>> | > + * wake up the parent since we are now a detached process. >>>>>> | > + */ >>>>>> | > + read_lock_irq(&tasklist_lock); >>>>>> | >>>>>> | read_lock() is enough. tasklist_lock is never taken for write from IRQs or >>>>>> | softIRQs. >>>>>> | >>>>>> | > + if (tsk->exit_state == EXIT_DEAD && !tsk->parent->exit_state) { >>>>>> | > + ckpt_debug("[%d, %s]: exit_checkpoint(): notifying " >>>>>> | > + "parent\n", tsk->pid, tsk->comm); >>>>>> | > + __wake_up_parent(tsk, tsk->parent); >>>>>> | > + } >>>>>> | > + read_unlock_irq(&tasklist_lock); >>>>>> | >>>>>> | Looking at this closer, I wonder if this wakeup logic should be called from >>>>>> | do_ghost_task(), right after setting ->exit_signal. This way there would be no >>>>>> | need for a tricky condition to recognize ghost tasks, and (I think) this is >>>>>> | closer to the other cases changing ->exit_signal (reparent_leader() and >>>>>> | exit_notify()). >>>>>> >>>>>> Yes, we tried the following in the earlier version. >>>>>> >>>>>> void ghost_auto_reapable() >>>>>> { >>>>>> write_lock(&tasklist_lock); >>>>>> current->exit_signal = -1; >>>>>> __wake_up_parent(current, current->parent); >>>>>> write_unlock(&tasklist_lock); >>>>>> } >>>>>> >>>>>> And called this from do_ghost_task(). But with this, the parent could >>>>>> wake up, find that it still has an eligible child (the ghost) to wait >>>>>> for, and go back to waiting before the ghost enters the EXIT_DEAD state. >>>>>> And so we would lose the wake up. >>>>>> >>>>>> (zap_pid_ns_processes() passes the __WALL so the ghost would be considered >>>>>> an eligible child). >>>>> >>>>> I think I see now. The point is that ->exit_signal = -1 is only meant to work >>>>> correctly for sub-threads, which the parent does not need to wait for. IOW, the >>>>> notion of detached task is only implemented for sub-threads. >>>>> >>>>> IIUC, setting ->exit_signal to -1 is only used here to let exit_notify() set >>>>> ->exit_state to EXIT_DEAD, right? Otherwise, setting ->exit_signal to 0 and >>>>> letting do_notify_parent() proceed for ghost tasks would have be sufficient I >>>>> guess (provided that the confusion between ghost tasks and zombies could be >>>>> easily avoided in do_notify_parent()). >>>>> >>>>> Then I agree that the proposed patch looks like a reasonably simple approach. >>>>> >>>>> Thanks for the explanation, >>>>> >>>> >>>> Louis, Suka: >>>> >>>> One subtlety with the method is that if a process get reparented >>>> (for whatever reason) then the ->exit_signal field is reset to >>>> SIGCHLD. Fortunately, that should not affect us because our ghost >>>> tasks never become orphaned. >>> >>> AFAICS, this is not true for detached tasks: neither in reparent_leader(), nor >>> in exit_notify(). Unless I missed another place? >> >> Doh.. you are totally right, I missed that. >> Sometime it feels really good to be wrong :) > > ;) > >> >>> >>>> >>>> However, I can't avoid thinking that maybe there is a better way >>>> to do this altogether ? >>>> >>>> The requirement is simple: ghost tasks are temporary tasks whose >>>> role is to keep certain pid's alive for the duration of the restart >>>> and then exit without a trace before the restarted tasks resume >>>> execution. >>>> >>>> The reason I opted for the ->exit_signal = -1 is because it makes >>>> sure that the parent need not explicitly collect the child ghost. >>>> >>>> If I had set it to 0, then it would not send a signal, but still >>>> would require a wait() to collect it (right ?). >>> >>> Right. >>> >>>> >>>> Can you think of a way to achieve this functionality without the >>>> subtleties that we have observed so far ? even at the cost of a >>>> minor change to, say, wait() logic or what not ? >>> >>> I wonder if things would be easier by providing a mean to distinguish ghost >>> tasks from truely restarted tasks in do_notify_parent(). >>> >>> Assuming this, there might be a kernel-patch-free way, although I can't say if >>> this fits userspace restart constraints: >>> >>> Setting the signal handler of ghost tasks' parent to SIG_IGN, and ->exit_signal >> >> Yes, I thought about it. But those parent are part of the restart, >> and changing their signal handlers will is tricky and will require >> complex sync logic at the end of restart to ensure that all tasks >> restore their handlers after the ghosts are gone and released, but >> before any task gets to userspace ... I don't want to go there. > > Yes, I can imagine well the added complexity in userspace. > >> >>> of ghost tasks to SIGCHLD, will make them autoreap. This requires that the >>> parent does not need to synchronize with other children until all ghost tasks >>> have exited, and that the parent remains alive too. >>> >>> So, why not adding some flag PF_RESTART_GHOST or TIF_RESTART_GHOST? >> >> Actually, we already have a way to distinguish them: >> >> if ((tsk->flags & PF_RESTARTING) && task_detached(tsk)) >> >> One problem with this is that we only set exit_signal to -1 after >> the ghost was born, so if the parent is already waiting I think >> that will be racy, as per suka's comment above. > > Yes. > >> >> So looking at the code again, we could add one condition in exit.c >> at wait_consider_task(), after the test of p->exit_state == EXIT_DEAD, >> to also test: >> >> inline static bool is_ghost_task(p) >> { >> return (p->flags & (PF_EXITING|PF_RESTARTING) == >> PF_EXITING|PF_RESTARTING) && task_detached(p) >> } >> >> if (p->flags & is_ghost_task(p)) >> return 0; >> >> Or something along the lines (e.g. used EXIT_ZOMBIE comparison instead >> of PF_EXITING). While requiring a kernel patch, it is relatively short, >> clean and easy to review. > > Adding the check in do_wait() is racy IMHO. What does prevent do_ghost_task() > from setting ->exit_signal = -1 after the parent sleeps in do_wait()? If we do it per your suggestion (and I also had suggested the same originally): >>>>>> void ghost_auto_reapable() >>>>>> { >>>>>> write_lock(&tasklist_lock); >>>>>> current->exit_signal = -1; >>>>>> __wake_up_parent(current, current->parent); >>>>>> write_unlock(&tasklist_lock); >>>>>> } then it should be safe, no ? > > Otherwise, why not making ghost tasks simply sub-threads of the "parent"? They > would autoreap and nobody would wait or have to wait for them. I would feel so > much better if we had not to change ->exit_signal in do_ghost_task()... There are two types of ghosts: 1) A placeholder needed to ensure that some processes are orphaned correctly (while retaining their sid) 2) A placeholder needed to ensure that a certain pid exists during the restart (and can be, e.g. used by other processes as pgid). in both cases, a thread of the parent won't do what I need: in the first case, the child will be re-parented to another thread (leader?). And in the second, the pgid will be be usable as a pgid. But please keep throwing suggestions, we're making progress here... Oren. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers