On 1/17/24 17:38, Oleg Nesterov wrote: > On 01/17, Bernd Edlinger wrote: >> Yes. but the tracer has to do its job, and that is ptrace_attach the >> remaining treads, it does not know that it would avoid a dead-lock >> when it calls wait(), instead of ptrace_attach. It does not know >> that the tracee has just called execve in one of the not yet traced >> threads. > > Hmm. I don't understand you. Certainly I am willing to rephrase this until it is understandable. Probably I have just not yet found the proper way to describe the issue here, and your help in resolving that documentation issue is very important to me. > > I agree we have a problem which should be fixed. Just the changelog > looks confusing to me, imo it doesn't explain the race/problem clearly. > I am trying here to summarize what the test case "attach" in ./tools/testing/selftests/ptrace/vmaccess.c does. I think it models the use case of a tracer that is trying to attach to a multi-threaded process that is executing execve in a not-yet traced thread while a different sub-thread is already traced, it is not relevant that the test case uses PTRACE_TRACEME, to make the sub-thead traced, the same would happen if the tracer uses some out-of-band mechanism like /proc/pid/task to learn the thread_id of the sub-threads and uses ptrace_attach to each of them. The test case hits the dead-lock because there is a race condition between before the PTRACE_ATTACH, and it cannot know that the exit event from the sub-thread is already pending before the PTRACE_ATTACH. Of course a real tracer will not sleep a whole second before a PTRACE_ATTACH, but even if it does a waitpid(-1, &s, WNOHANG) immediately before the PTRACE_ATTACH there is a tiny chance that the execve is entered just immediately after waitpid has indicated that there is currently not event pending. >>>> + if (unlikely(t->ptrace) >>>> + && (t != tsk->group_leader || !t->exit_state)) >>>> + unsafe_execve_in_progress = true; >>> >>> The !t->exit_state is not right... This sub-thread can already be a zombie >>> with ->exit_state != 0 but see above, it won't be reaped until the debugger >>> does wait(). >>> >> >> I dont think so. >> de_thread() handles the group_leader different than normal threads. > > I don't follow... > > I didn't say that t is a group leader. I said it can be a zombie sub-thread > with ->exit_state != 0. > the condition here is (t != tsk->group_leader || !t->exit_state) so in other words, if t is a sub-thread, i.e. t != tsk->group_leader then the t->exit_state does not count, and the deadlock is possible. But if t it is a group leader, then t == tsk->group_leader, but a deadlock is only possible when t->exit_state == 0 at this time. The most likely reason for this is PTRACE_O_TRACEEXIT. I will add a new test case that demonstrates this in the next iteration of this patch. Here is a preview of what I have right now: /* * Same test as previous, except that * the group leader is ptraced first, * but this time with PTRACE_O_TRACEEXIT, * and the thread that does execve is * not yet ptraced. This exercises the * code block in de_thread where the * if (!thread_group_leader(tsk)) { * is executed and enters a wait state. */ static long thread2_tid; static void *thread2(void *arg) { thread2_tid = syscall(__NR_gettid); sleep(2); execlp("false", "false", NULL); return NULL; } TEST(attach2) { int s, k, pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread2, NULL); pthread_join(pt, NULL); return; } sleep(1); k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT); ASSERT_EQ(k, 0); thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L); ASSERT_NE(thread2_tid, -1); ASSERT_NE(thread2_tid, 0); ASSERT_NE(thread2_tid, pid); k = waitpid(-1, &s, WNOHANG); ASSERT_EQ(k, 0); sleep(2); /* deadlock may happen here */ k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, WNOHANG); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGTRAP); k = waitpid(-1, &s, WNOHANG); ASSERT_EQ(k, 0); k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGTRAP); k = waitpid(-1, &s, WNOHANG); ASSERT_EQ(k, 0); k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); k = waitpid(-1, &s, WNOHANG); ASSERT_EQ(k, 0); k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFEXITED(s), 1); ASSERT_EQ(WEXITSTATUS(s), 1); k = waitpid(-1, NULL, 0); ASSERT_EQ(k, -1); ASSERT_EQ(errno, ECHILD); } So the traced process does the execve in the sub-thread, and the tracer attaches the thread leader first, and when the next PTRACE_ATTACH happens, the thread leader is stopped because of the PTRACE_O_TACEEXIT. So at that time, the t->exit_state == 0, and we receive the following: k = waitpid(-1, &s, WNOHANG); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGTRAP); yet the de_thread is not finished now, but only when k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); happens. The only thing, that is admittedly pretty confusing here, is the fact that the thread2_tid morphs into group leader's pid, at this time, and is therefore never heard of again. So pid refers to the former thread2_tid from now on, and the former group leader does not enter the usual zombie state here, because the sub-thread takes over it's role. >>>> + if (unlikely(unsafe_execve_in_progress)) { >>>> + spin_unlock_irq(lock); >>>> + sig->exec_bprm = bprm; >>>> + mutex_unlock(&sig->cred_guard_mutex); >>>> + spin_lock_irq(lock); >>> >>> I don't understand why do we need to unlock and lock siglock here... >> >> That is just a precaution because I did want to release the >> mutexes exactly in the reverse order as they were acquired. > > To me this adds the unnecessary complication. > Well, the proposed change creates this sequence mutex_lock(&sig_cred_guard_mutex); spin_lock_irq(lock); mutex_unlock(&sig_cred_guard_mutex); spin_unlock_irq(lock); I wanted to avoid that, because in a usual real-time os, I'd expect the mutex_unlock to schedule another waiting task, regardless of the spin lock state. Are you saying that doing this is safe to do in linux, because the scheduling does not happen except when explicitly asked for e.g. by calling schedule() ? And would that also be safe for real time linux port ? >>> But my main question is why do we need the unsafe_execve_in_progress boolean. >>> If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex >>> when one of the threads is traced, then why can't we do this unconditionally ? >>> >> >> I just wanted to keep the impact of the change as small as possible, > > But the unsafe_execve_in_progress logic increases the impact and complicates > the patch. > > I think the fix should be as simple as possible. (to be honest, right now > I don't think this is a right approach). > The main concern was when a set-suid program is executed by execve. Then it makes a difference if the current thread is traced before the execve or not. That means if the current thread is already traced, the decision, which credentials will be used is different than otherwise. So currently there are two possbilities, either the trace happens before the execve, and the suid-bit will be ignored, or the trace happens after the execve, but it is checked that the now potentially more privileged credentials allow the tracer to proceed. With this patch we will have a third prossibility, that is in order to avoid the possible dead-lock we allow the suid-bit to take effect, but only if the tracer's privileges allow both to attach the current credentials and the new credentials. But I would only do that as a last resort, to avoid the possible dead-lock, and not unless a dead-lock is really expected to happen. Thanks Bernd.