Re: [PATCHv4] exec: Fix a deadlock in ptrace

Christian Brauner <christian.brauner@xxxxxxxxxx> · Tue, 3 Mar 2020 09:43:46 +0100

On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> > On 3/3/20 6:29 AM, Kees Cook wrote:
> > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> > >> On 3/3/20 3:26 AM, Kees Cook wrote:
> > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > >>>> [...]
> > >>>
> > >>> If I'm reading this patch correctly, this changes the lifetime of the
> > >>> cred_guard_mutex lock to be:
> > >>> 	- during prepare_bprm_creds()
> > >>> 	- from flush_old_exec() through install_exec_creds()
> > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > >>> install_exec_creds().
> > > 
> > > BTW, I think the effect of this change (i.e. my paragraph above) should
> > > be distinctly called out in the commit log if this solution moves
> > > forward.
> > > 
> > 
> > Okay, will do.
> > 
> > >>> That means, for example, that check_unsafe_exec()'s documented invariant
> > >>> is violated:
> > >>>     /*
> > >>>      * determine how safe it is to execute the proposed program
> > >>>      * - the caller must hold ->cred_guard_mutex to protect against
> > >>>      *   PTRACE_ATTACH or seccomp thread-sync
> > >>>      */
> > >>
> > >> Oh, right, I haven't understood that hint...
> > > 
> > > I know no_new_privs is checked there, but I haven't studied the
> > > PTRACE_ATTACH part of that comment. If that is handled with the new
> > > check, this comment should be updated.
> > > 
> > 
> > Okay, I change that comment to:
> > 
> > /*
> >  * determine how safe it is to execute the proposed program
> >  * - the caller must have set ->cred_locked_in_execve to protect against
> >  *   PTRACE_ATTACH or seccomp thread-sync
> >  */
> > 
> > >>> I think it also means that the potentially multiple invocations
> > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > >>> a lock (another place where current's no_new_privs is evaluated).
> > >>
> > >> So no_new_privs can change from 0->1, but should not
> > >> when execve is running.
> > >>
> > >> As long as the calling thread is in execve it won't do this,
> > >> and the only other place, where it may set for other threads
> > >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > > 
> > > Yeah, everything was fine until I had to go complicate things with
> > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > > flag was used to control this, but it required that the filter not get
> > > applied during exec.
> > > 
> > >>> Related, it also means that cred_guard_mutex is unheld for every
> > >>> invocation of search_binary_handler() (which can loop via the previously
> > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > >>> currently.)
> > >>>
> > >>> For seccomp, the expectations about existing thread states risks races
> > >>> too. There are two locks held for TSYNC:
> > >>> - current->sighand->siglock is held to keep new threads from
> > >>>   appearing/disappearing, which would destroy filter refcounting and
> > >>>   lead to memory corruption.
> > >>
> > >> I don't understand what you mean here.
> > >> How can this lead to memory corruption?
> > > 
> > > Mainly this is a matter of how seccomp manages its filter hierarchy
> > > (since the filters are shared through process ancestry), so if a thread
> > > appears in the middle of TSYNC it may be racing another TSYNC and break
> > > ancestry, leading to bad reference counting on process death, etc.
> > > (Though, yes, with refcount_t now, things should never corrupt, just
> > > waste memory.)
> > > 
> > 
> > I assume for now, that the current->sighand->siglock held while iterating all
> > threads is sufficient here.
> > 
> > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> > >>>   avoid no_new_privs and filter confusion during exec, which could
> > >>>   lead to exploitable setuid conditions (see below).
> > >>>
> > >>> Just racing a malicious thread during TSYNC is not a very strong
> > >>> example (a malicious thread could do lots of fun things to "current"
> > >>> before it ever got near calling TSYNC), but I think there is the risk
> > >>> of mismatched/confused states that we don't want to allow. One is a
> > >>> particularly bad state that could lead to privilege escalations (in the
> > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > >>> has a filter attached that silently fails a priv-dropping setuid call
> > >>> and continues execution with elevated privs, it can be tricked into
> > >>> doing bad things on behalf of the unprivileged parent, which was the
> > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > >>>
> > >>> thread A clones thread B
> > >>> thread B starts setuid exec
> > >>> thread A sets no_new_privs
> > >>> thread A calls seccomp with TSYNC
> > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > >>> thread B passes check_unsafe_exec() with no_new_privs unset
> > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > >>> thread B finishes exec, now running with elevated privs, a filter chosen
> > >>>          by thread A, _and_ nnp set (which doesn't matter)
> > >>>
> > >>> With the original locking, thread B will fail check_unsafe_exec()
> > >>> because filter and nnp state are changed together, with "atomicity"
> > >>> protected by the cred_guard_mutex.
> > >>>
> > >>
> > >> Ah, good point, thanks!
> > >>
> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> > >> while the cred_guard_mutex is locked, like this for instance:
> > >>
> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > >> index b6ea3dc..377abf0 100644
> > >> --- a/kernel/seccomp.c
> > >> +++ b/kernel/seccomp.c
> > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> > >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> > >>         assert_spin_locked(&current->sighand->siglock);
> > >>  
> > >> +       if (current->signal->cred_locked_for_ptrace)
> > >> +               return -EAGAIN;
> > >> +
> > > 
> > > Hmm. I guess something like that could work. TSYNC expects to be able to
> > > report _which_ thread wrecked the call, though... I wonder if in_execve
> > > could be used to figure out the offending thread. Hm, nope, that would
> > > be outside of lock too (and all users are "current" right now, so the
> > > lock wasn't needed before).
> > > 
> > 
> > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> > but the caller will die quickly and cannot do anything with that information
> > when another thread executes execve, right?
> > 
> > >>         /* Validate all threads being eligible for synchronization. */
> > >>         caller = current;
> > >>         for_each_thread(caller, thread) {
> > >>
> > >>
> > >>> And this is just the bad state I _can_ see. I'm worried there are more...
> > >>>
> > >>> All this said, I do see a small similarity here to the work I did to
> > >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> > >>> decisions for the bprm based on current's state -- but current's state
> > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > >>> current's copy until exec ended and then stored bprm's copy into current.
> > >>> If the only problem anyone can see here is the handling of no_new_privs,
> > >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> > >>> from cred_guard_mutex.
> > >>>
> > >>
> > >> I still think that is solvable with using cred_locked_for_ptrace and
> > >> simply make the tsync fail if it would otherwise be blocked.
> > > 
> > > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > > 
> > 
> > Yeah, I'd go with "cred_locked_in_execve".
> > 
> > > And the comment on bool cred_locked_for_ptrace should mention that
> > > access is only allowed under cred_guard_mutex lock.
> > > 
> > 
> > okay.
> > 
> > >>>> +	sig->cred_locked_for_ptrace = false;
> > > 
> > > This is redundant to the zalloc -- I think you can drop it (unless
> > > someone wants to keep it for clarify?)
> > > 
> > 
> > I'll remove that here and in init/init_task.c
> > 
> > > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > > be sufficient to see a stable version of the thread...
> > > 
> > 
> > No, these need to be addressed individually, but most users just want
> > to know if the current credentials are sufficient at this moment, but will
> > not change the credentials, as ptrace and TSYNC do. 
> > 
> > BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> > So adding an access to cred_locked_for_execve in ptrace_may_access is
> > probably not an option.
> > 
> > However, one nice added value by this change is this:
> > 
> > void *thread(void *arg)
> > {
> > 	ptrace(PTRACE_TRACEME, 0,0,0);
> > 	return NULL;
> > }
> > 
> > int main(void)
> > {
> > 	int pid = fork();
> > 
> > 	if (!pid) {
> > 		pthread_t pt;
> > 		pthread_create(&pt, NULL, thread, NULL);
> > 		pthread_join(pt, NULL);
> > 		execlp("echo", "echo", "passed", NULL);
> > 	}
> > 
> > 	sleep(1000);
> > 	ptrace(PTRACE_ATTACH, pid, 0,0);
> > 	kill(pid, SIGCONT);
> > 	return 0;
> > }
> > 
> > cat /proc/3812/stack 
> > [<0>] flush_old_exec+0xbf/0x760
> > [<0>] load_elf_binary+0x35a/0x16c0
> > [<0>] search_binary_handler+0x97/0x1d0
> > [<0>] __do_execve_file.isra.40+0x624/0x920
> > [<0>] __x64_sys_execve+0x49/0x60
> > [<0>] do_syscall_64+0x64/0x220
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > > (I remain very nervous about weakening cred_guard_mutex without
> > > addressing the many many users...)
> > > 
> > 
> > They need to be looked at closely, that's pretty clear.
> > Most fall in the class, that just the current credentials need
> > to stay stable for a certain time.
> 
> I remain rather set on wanting some very basic tests with this change.
> Imho, looking through tools/testing/selftests again we don't have nearly
> enough for these codepaths; not to say none. Basically, if someone wants
> to make a change affecting the current problem we should really have at
> least a single simple test/reproducer that can be run without digging
> through lore. And hopefully over time we'll have more tests.

Which you added in v4. Which is great! (I should've mentioned this in my
first mail.)
Christian