Hi! There is that old deadlock of cred_guard_mutex that I originally heard about from Oleg, and that has come up on LKML sometimes since then; to recap, essentially, the problem is demonstrated by the following testcase: ======== #include <pthread.h> #include <stdlib.h> #include <sys/ptrace.h> #include <unistd.h> #include <signal.h> void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0, 0, 0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); ptrace(PTRACE_ATTACH, pid, 0, 0); kill(pid, SIGCONT); return 0; } ======== The parent gets stuck in ptrace(), waiting for the cred_guard_mutex: ======== [<0>] ptrace_attach+0x97/0x250 [<0>] __x64_sys_ptrace+0xa2/0x100 [<0>] do_syscall_64+0x55/0x110 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ======== The child gets stuck in execve(), waiting in de_thread() for its other thread: ======== ? __schedule+0x2b7/0x880 schedule+0x28/0x80 flush_old_exec+0xc8/0x6c0 load_elf_binary+0x291/0x15d0 ? get_user_pages_remote+0x137/0x1f0 ? get_acl+0x1a/0x100 search_binary_handler+0x90/0x1c0 __do_execve_file.isra.37+0x6b8/0x880 __x64_sys_execve+0x34/0x40 do_syscall_64+0x55/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ======== But the other thread can't exit until the ptrace parent handles it; and the ptrace parent is stuck in another syscall, and therefore can't wait() for that thread. This can cause debuggers to lock up, and also gets in the way of taking cred_guard_mutex in more places (since that'd make the deadlock worse). IMO at the core of the problem is that the cred_guard_mutex is held across pretty much all of sys_execve(), and in particular across do_thread(). I think that actually, we could drop the cred_guard_mutex before waiting for sibling threads if we make things like the ptrace check a bit more complicated. At the moment, cred_guard_mutex is held across de_thread() and beyond to protect against things like seccomp TSYNC from a sibling thread or ptrace attach. For TSYNC, this is necessary because another thread could try to TSYNC to us as long as we're not done with de_thread(); for ptrace attach, this is necessary because we only switch credentials after de_thread(). But just doing the credential switch earlier would also be dangerous. So here's my idea: When we're done calculating the post-execve creds (which happens before de_thread()), we stash a pointer to the post-exec creds in current->signal (or in current?). Then, in de_thread(), once we know that all our siblings have been zapped, we drop the cred_guard_mutex *before* the loop that waits for the siblings to exit. At that point, sibling threads that are mid-syscall and other processes can try to interact with us again. This means: For accesses that are permitted based on whether the access is same-thread (in particular TSYNC and probably also the same_thread_group() check in ptrace), we'll have to check whether exec creds are stashed in current->signal; if so, we have to bail out. Any error code we return there shouldn't be visible to userspace, since this can only happen to a thread that has already been zapped. For accesses that go through ptrace_may_access(), we'll have to expand the check such that, if an attempt is made to access a task that is currently going through execve, the access check is performed against *both* its old and its new credentials. This also means that the target credentials will have to be plumbed into the LSMs in addition to the target task pointer. Opinions? If nobody thinks that this is an incredibly stupid idea, I'll try to come up with some patches. _______________________________________________ Selinux mailing list Selinux@xxxxxxxxxxxxx To unsubscribe, send email to Selinux-leave@xxxxxxxxxxxxx. To get help, send an email containing "help" to Selinux-request@xxxxxxxxxxxxx.