On 3/3/20 9:58 AM, Christian Brauner wrote: > On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote: >> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote: >>> This fixes a deadlock in the tracer when tracing a multi-threaded >>> application that calls execve while more than one thread are running. >>> >>> I observed that when running strace on the gcc test suite, it always >>> blocks after a while, when expect calls execve, because other threads >>> have to be terminated. They send ptrace events, but the strace is no >>> longer able to respond, since it is blocked in vm_access. >>> >>> The deadlock is always happening when strace needs to access the >>> tracees process mmap, while another thread in the tracee starts to >>> execve a child process, but that cannot continue until the >>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: >>> >>> strace D 0 30614 30584 0x00000000 >>> Call Trace: >>> __schedule+0x3ce/0x6e0 >>> schedule+0x5c/0xd0 >>> schedule_preempt_disabled+0x15/0x20 >>> __mutex_lock.isra.13+0x1ec/0x520 >>> __mutex_lock_killable_slowpath+0x13/0x20 >>> mutex_lock_killable+0x28/0x30 >>> mm_access+0x27/0xa0 >>> process_vm_rw_core.isra.3+0xff/0x550 >>> process_vm_rw+0xdd/0xf0 >>> __x64_sys_process_vm_readv+0x31/0x40 >>> do_syscall_64+0x64/0x220 >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> expect D 0 31933 30876 0x80004003 >>> Call Trace: >>> __schedule+0x3ce/0x6e0 >>> schedule+0x5c/0xd0 >>> flush_old_exec+0xc4/0x770 >>> load_elf_binary+0x35a/0x16c0 >>> search_binary_handler+0x97/0x1d0 >>> __do_execve_file.isra.40+0x5d4/0x8a0 >>> __x64_sys_execve+0x49/0x60 >>> do_syscall_64+0x64/0x220 >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> The proposed solution is to take the cred_guard_mutex only >>> in a critical section at the beginning, and at the end of the >>> execve function, and let PTRACE_ATTACH fail with EAGAIN while >>> execve is not complete, but other functions like vm_access are >>> allowed to complete normally. >> >> Sorry to be bummer, but I don't think this will work. A few more things >> during the exec process depend on cred_guard_mutex being held. >> >> If I'm reading this patch correctly, this changes the lifetime of the >> cred_guard_mutex lock to be: >> - during prepare_bprm_creds() >> - from flush_old_exec() through install_exec_creds() >> Before, cred_guard_mutex was held from prepare_bprm_creds() through >> install_exec_creds(). >> >> That means, for example, that check_unsafe_exec()'s documented invariant >> is violated: >> /* >> * determine how safe it is to execute the proposed program >> * - the caller must hold ->cred_guard_mutex to protect against >> * PTRACE_ATTACH or seccomp thread-sync >> */ >> static void check_unsafe_exec(struct linux_binprm *bprm) ... >> which is looking at no_new_privs as well as other details, and making >> decisions about the bprm state from the current state. >> >> I think it also means that the potentially multiple invocations >> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and >> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without >> a lock (another place where current's no_new_privs is evaluated). >> >> Related, it also means that cred_guard_mutex is unheld for every >> invocation of search_binary_handler() (which can loop via the previously >> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden >> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() >> currently.) > > So one issue I see with having to reacquire the cred_guard_mutex might > be that this would allow tasks holding the cred_guard_mutex to block a > killed exec'ing task from exiting, right? > Yes maybe, but I think it will not be worse than it is now. Since the second time the mutex is acquired it is done with mutex_lock_killable, so at least kill -9 should get it terminated. Bernd.