On Thu, Mar 20, 2025 at 01:09:38PM -0700, Kees Cook wrote: > Hey look another threaded exec bug. :| > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote: > > ================================================================== > > BUG: KCSAN: data-race in bprm_execve / copy_fs > > > > write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0: > > bprm_execve+0x748/0x9c0 fs/exec.c:1884 > > This is: > > current->fs->in_exec = 0; > > And is part of the execve failure path: > > out: > ... > if (bprm->point_of_no_return && !fatal_signal_pending(current)) > force_fatal_sig(SIGSEGV); > > sched_mm_cid_after_execve(current); > current->fs->in_exec = 0; > current->in_execve = 0; > > return retval; > } > > > do_execveat_common+0x769/0x7e0 fs/exec.c:1966 > > do_execveat fs/exec.c:2051 [inline] > > __do_sys_execveat fs/exec.c:2125 [inline] > > __se_sys_execveat fs/exec.c:2119 [inline] > > __x64_sys_execveat+0x75/0x90 fs/exec.c:2119 > > x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1: > > copy_fs+0x95/0xf0 kernel/fork.c:1770 > > This is: > > if (fs->in_exec) { > > Which is under lock: > > struct fs_struct *fs = current->fs; > if (clone_flags & CLONE_FS) { > /* tsk->fs is already what we want */ > spin_lock(&fs->lock); > /* "users" and "in_exec" locked for check_unsafe_exec() * */ > if (fs->in_exec) { > spin_unlock(&fs->lock); > return -EAGAIN; > } > fs->users++; > spin_unlock(&fs->lock); > > > Does execve need to be taking this lock? The other thing touching it is > check_unsafe_exec(), which takes the lock. It looks like the bprm_execve() > lock was removed in commit 8c652f96d385 ("do_execve() must not clear > fs->in_exec if it was set by another thread") which used the return > value from check_unsafe_exec(): > > When do_execve() succeeds, it is safe to clear ->in_exec unconditionally. > It can be set only if we don't share ->fs with another process, and since > we already killed all sub-threads either ->in_exec == 0 or we are the > only user of this ->fs. > > Also, we do not need fs->lock to clear fs->in_exec. > > This logic was updated in commit 9e00cdb091b0 ("exec:check_unsafe_exec: > kill the dead -EAGAIN and clear_in_exec logic"), which includes this > rationale: > > 2. "out_unmark:" in do_execve_common() is either called > under ->cred_guard_mutex, or after de_thread() which > kills other threads, so we can't race with sub-thread > which could set ->in_exec. And if ->fs is shared with > another process ->in_exec should be false anyway. > > The de_thread() is part of the "point of no return" in exec_binprm(), > called via exec_binprm(). But the bprm_execve() error path is reachable > from many paths prior to the point of no return. > > What I can imagine here is two failing execs racing a fork: > > A start execve > B fork with CLONE_FS > C start execve, reach check_unsafe_exec(), set fs->in_exec > A bprm_execve() failure, clear fs->in_exec > B copy_fs() increment fs->users. > C bprm_execve() failure, clear fs->in_exec > > But I don't think this is a "real" flaw, though, since the locking is to > protect a _successful_ execve from a fork (i.e. getting the user count > right). A successful execve will de_thread, and I don't see any wrong > counting of fs->users with regard to thread lifetime. > > Did I miss something in the analysis? Should we perform locking anyway, > or add data race annotations, or something else? Afaict, the only way this data race can happen is if we jump to the cleanup label and then reset current->fs->in_exec. If the execve was successful there's no one to race us with CLONE_FS obviously because we took down all other threads. I think the logic in commit 9e00cdb091b0 ("exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic") is sound. This is a harmless data race that can only happen if the execve fails. The worst that can happen is that a subthread does clone(CLONE_FS) and gets a spurious error because it raced with the exec'ing subthread resetting fs->in_exec. So I think all we need is: diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..177acaf196a9 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1881,7 +1881,13 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; + /* + * If this execve failed before de_thread() and another + * subthread is concurrently forking with CLONE_FS they race + * with us resetting current->fs->in_exec. This is fine, + * annotate it. + */ + data_race(current->fs->in_exec = 1); current->in_execve = 0; return retval;