On 11/04, Oleg Nesterov wrote: > > On 11/04, Oleg Nesterov wrote: > > > > On 11/04, Eric W. Biederman wrote: > > > > > > The following mostly correct patch modifies zap_other_threads in > > > the case of a de_thread to not wait for zombies to be reaped. The only > > > case that cares is ptrace (as threads are self reaping). So I don't > > > think this will cause any problems except removing the strace -f race. > > > > From my previous email: > > > > So the only plan I currently have is change de_thread() to wait until > > other threads pass exit_notify() or even exit_signals(), but I don't > > like this. > > > > And yes, I don't like this, but perhaps this is what we should do. > > > > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT > > checks doesn't look right, but off course technically this change should > > be simple enough. > > > > But not that simple. Just for example, the exiting sub-threads should > > not run with ->group_leader pointing to nowhere, in case it was reaped > > by de_thread. > > Not to mention other potential problems outside of ptrace/exec. For example > userns_install() can fail after mt-exec even without ptrace, simply because > thread_group_empty() can be false. Sure, easy to fix, and probably _install() > should use signal->live anyway, but still. > > And I didn't mention the fun with sighand unsharing. We simply can't do this > until all sub-threads go away. IOW, your patch breaks the usage of ->siglock. > The execing thread and the zombie threads will use different locks to, say, > remove the task from thread-group. Again, this is fixable, but not that > simple. > > > And we have another problem with PTRACE_EVENT_EXIT which can lead to the > > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never > > defined. But this change will add the user-visible change. > > > > And if we add the user-visible changes, then perhaps we could simply untrace > > the traced sub-threads on exec. This change is simple, we do not even need > > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace > > if exec is in progress. But I'm afraid we can't do this. So I was thinking about something like below. Untested, probably buggy/incomplete too, but hopefully can work. flush_old_exec() calls the new kill_sub_threads() helper which waits until all the sub-threads pass exit_notify(). de_thread() is called after install_exec_creds(), it is simplified and waits for thread_group_empty() without cred_guard_mutex. But again, I do not really like this, and we need to do something with PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible. And I disagree that this has nothing to do with cred_guard_mutex. And in any case we should narrow its scope in do_execve() path. Why do we take it so early? Why do we need to do, say, copy_strings() with this lock held? The original motivation for this has gone, acct_arg_size() can work just fine even if multiple threads call sys_execve(). I'll try to discuss the possible changes in LSM hooks with Jann, I still think that this is what we actually need to do. At least try to do, possibly this is too complicated. Oleg. fs/binfmt_elf.c | 6 ++- fs/exec.c | 108 ++++++++++++++++++++++++------------------------ include/linux/binfmts.h | 1 + kernel/exit.c | 5 ++- kernel/signal.c | 3 +- 5 files changed, 65 insertions(+), 58 deletions(-) --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm) setup_new_exec(bprm); install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto out_free_dentry; + /* Do this so that we can load the interpreter, if need be. We will change some of these later */ retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); if (retval < 0) goto out_free_dentry; - + current->mm->start_stack = bprm->p; /* Now we do a little grungy work by mmapping the ELF image into diff --git a/fs/exec.c b/fs/exec.c index 4e497b9..7246b9f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1036,13 +1036,59 @@ static int exec_mmap(struct mm_struct *mm) return 0; } +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig) +{ + for (;;) { + if (unlikely(__fatal_signal_pending(tsk))) + goto killed; + set_current_state(TASK_KILLABLE); + if (!sig->notify_count) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; + +killed: + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exit_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); + return -EINTR; +} + +static int kill_sub_threads(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + int err = -EINTR; + + if (thread_group_empty(tsk)) + return 0; + + read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); + if (!signal_group_exit(sig)) { + sig->group_exit_task = tsk; + sig->notify_count = -zap_other_threads(tsk); + err = 0; + } + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); + + if (!err) + err = wait_for_notify_count(tsk, sig); + return err; + +} + /* * This function makes sure the current process has its own signal table, * so that flush_signal_handlers can later reset the handlers without * disturbing other processes. (Other processes might share the signal * table via the CLONE_SIGHAND option to clone().) */ -static int de_thread(struct task_struct *tsk) +int de_thread(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1051,34 +1097,15 @@ static int de_thread(struct task_struct *tsk) if (thread_group_empty(tsk)) goto no_thread_group; - /* - * Kill all other threads in the thread group. - */ spin_lock_irq(lock); - if (signal_group_exit(sig)) { - /* - * Another group action in progress, just - * return so that the signal is processed. - */ - spin_unlock_irq(lock); - return -EAGAIN; - } - - sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); + sig->notify_count = sig->nr_threads; if (!thread_group_leader(tsk)) sig->notify_count--; - - while (sig->notify_count) { - __set_current_state(TASK_KILLABLE); - spin_unlock_irq(lock); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - spin_lock_irq(lock); - } spin_unlock_irq(lock); + if (wait_for_notify_count(tsk, sig)) + return -EINTR; + /* * At this point all other threads have exited, all we have to * do is to wait for the thread group leader to become inactive, @@ -1087,24 +1114,8 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } - + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a * process, regardless of execs it's done, is start_time. @@ -1162,10 +1173,9 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } +no_thread_group: sig->group_exit_task = NULL; sig->notify_count = 0; - -no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; @@ -1197,14 +1207,6 @@ static int de_thread(struct task_struct *tsk) BUG_ON(!thread_group_leader(tsk)); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exit_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; } char *get_task_comm(char *buf, struct task_struct *tsk) @@ -1239,7 +1241,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = kill_sub_threads(current); if (retval) goto out; --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *); extern int search_binary_handler(struct linux_binprm *); extern int flush_old_exec(struct linux_binprm * bprm); extern void setup_new_exec(struct linux_binprm * bprm); +extern int de_thread(struct task_struct *tsk); extern void would_dump(struct linux_binprm *, struct file *); extern int suid_dumpable; diff --git a/kernel/exit.c b/kernel/exit.c index 9d68c45..f3dd46d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -690,8 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) + /* mt-exec, kill_sub_threads() is waiting for group exit */ + if (unlikely(tsk->signal->notify_count < 0) && + !++tsk->signal->notify_count) wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1194,13 +1194,12 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; - /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html