Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs (4)

Christian Brauner <brauner@xxxxxxxxxx> · Sat, 22 Mar 2025 11:28:37 +0100

On Sat, Mar 22, 2025 at 11:15:44AM +0100, Mateusz Guzik wrote:
> On Fri, Mar 21, 2025 at 11:26:03PM -0700, Kees Cook wrote:
> > On Sat, Mar 22, 2025 at 01:00:08AM +0000, Al Viro wrote:
> > > On Fri, Mar 21, 2025 at 09:45:39AM +0100, Christian Brauner wrote:
> > > 
> > > > Afaict, the only way this data race can happen is if we jump to the
> > > > cleanup label and then reset current->fs->in_exec. If the execve was
> > > > successful there's no one to race us with CLONE_FS obviously because we
> > > > took down all other threads.
> > > 
> > > Not really.
> > 
> > Yeah, you found it. Thank you!
> > 
> > > 1) A enters check_unsafe_execve(), sets ->in_exec to 1
> > > 2) B enters check_unsafe_execve(), sets ->in_exec to 1
> > 
> > With 3 threads A, B, and C already running, fs->users == 3, so steps (1)
> > and (2) happily pass.
> > 
> > > 3) A calls exec_binprm(), fails (bad binary)
> > > 4) A clears ->in_exec
> > > 5) C calls clone(2) with CLONE_FS and spawns D - ->in_exec is 0
> > 
> > D's creation bumps fs->users == 4.
> > 
> > > 6) B gets through exec_binprm(), kills A and C, but not D.
> > > 7) B clears ->in_exec, returns
> > > 
> > > Result: B and D share ->fs, B runs suid binary.
> > > 
> > > Had (5) happened prior to (2), (2) wouldn't have set ->in_exec;
> > > had (5) happened prior to (4), clone() would've failed; had
> > > (5) been delayed past (6), there wouldn't have been a thread
> > > to call clone().
> > > 
> > > But in the window between (4) and (6), clone() doesn't see
> > > execve() in progress and check_unsafe_execve() has already
> > > been done, so it hadn't seen the extra thread.
> > > 
> > > IOW, it really is racy.  It's a counter, not a flag.
> > 
> > Yeah, I would agree. Totally untested patch:
> > 
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 506cd411f4ac..988b8621c079 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1632,7 +1632,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
> >  	if (p->fs->users > n_fs)
> >  		bprm->unsafe |= LSM_UNSAFE_SHARE;
> >  	else
> > -		p->fs->in_exec = 1;
> > +		refcount_inc(&p->fs->in_exec);
> >  	spin_unlock(&p->fs->lock);
> >  }
> >  
> > @@ -1862,7 +1862,7 @@ static int bprm_execve(struct linux_binprm *bprm)
> >  
> >  	sched_mm_cid_after_execve(current);
> >  	/* execve succeeded */
> > -	current->fs->in_exec = 0;
> > +	refcount_dec(&current->fs->in_exec);
> >  	current->in_execve = 0;
> >  	rseq_execve(current);
> >  	user_events_execve(current);
> > @@ -1881,7 +1881,7 @@ static int bprm_execve(struct linux_binprm *bprm)
> >  		force_fatal_sig(SIGSEGV);
> >  
> >  	sched_mm_cid_after_execve(current);
> > -	current->fs->in_exec = 0;
> > +	refcount_dec(&current->fs->in_exec);
> >  	current->in_execve = 0;
> >  
> >  	return retval;
> 
> The bump is conditional and with this patch you may be issuing
> refcount_dec when you declined to refcount_inc.
> 
> A special case where there are others to worry about and which proceeds
> with an exec without leaving in any indicators is imo sketchy.
> 
> I would argue it would make the most sense to serialize these execs.

Yeah, I tend to agree.

And fwiw, I had proposed somewhere else already that we should start
restricting thread-group exec to the thread-group leader because
subthread exec is about as ugly as it can get.

I'd like to add a sysctl for this with the goal of removing this
completely in the future. I think there are very few if any legitimate
cases for subthread aka non-thread-group leader exec. It not just
complicates things it also means two task switch TIDs which is super
confusing from userspace (It also complicates pidfd polling but that's
an aside.). I'll wip up a patch for this once I get back from travel.

> 
> Vast majority of programs are single-threaded when they exec with an
> unshared ->fs, so they don't need to bear any overhead nor complexity
> modulo a branch.
> 
> For any fucky case you can park yourself waiting for any pending exec to
> finish.