Re: [PATCHv4] exec: Fix a deadlock in ptrace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> 	- during prepare_bprm_creds()
> >>> 	- from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> > 
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> > 
> 
> Okay, will do.
> 
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>>     /*
> >>>      * determine how safe it is to execute the proposed program
> >>>      * - the caller must hold ->cred_guard_mutex to protect against
> >>>      *   PTRACE_ATTACH or seccomp thread-sync
> >>>      */
> >>
> >> Oh, right, I haven't understood that hint...
> > 
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> > 
> 
> Okay, I change that comment to:
> 
> /*
>  * determine how safe it is to execute the proposed program
>  * - the caller must have set ->cred_locked_in_execve to protect against
>  *   PTRACE_ATTACH or seccomp thread-sync
>  */
> 
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > 
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> > 
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>>   appearing/disappearing, which would destroy filter refcounting and
> >>>   lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> > 
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> > 
> 
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
> 
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>>   avoid no_new_privs and filter confusion during exec, which could
> >>>   lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>>          by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> >>         assert_spin_locked(&current->sighand->siglock);
> >>  
> >> +       if (current->signal->cred_locked_for_ptrace)
> >> +               return -EAGAIN;
> >> +
> > 
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> > 
> 
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
> 
> >>         /* Validate all threads being eligible for synchronization. */
> >>         caller = current;
> >>         for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> > 
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > 
> 
> Yeah, I'd go with "cred_locked_in_execve".
> 
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> > 
> 
> okay.
> 
> >>>> +	sig->cred_locked_for_ptrace = false;
> > 
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> > 
> 
> I'll remove that here and in init/init_task.c
> 
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> > 
> 
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do. 
> 
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.
> 
> However, one nice added value by this change is this:
> 
> void *thread(void *arg)
> {
> 	ptrace(PTRACE_TRACEME, 0,0,0);
> 	return NULL;
> }
> 
> int main(void)
> {
> 	int pid = fork();
> 
> 	if (!pid) {
> 		pthread_t pt;
> 		pthread_create(&pt, NULL, thread, NULL);
> 		pthread_join(pt, NULL);
> 		execlp("echo", "echo", "passed", NULL);
> 	}
> 
> 	sleep(1000);
> 	ptrace(PTRACE_ATTACH, pid, 0,0);
> 	kill(pid, SIGCONT);
> 	return 0;
> }
> 
> cat /proc/3812/stack 
> [<0>] flush_old_exec+0xbf/0x760
> [<0>] load_elf_binary+0x35a/0x16c0
> [<0>] search_binary_handler+0x97/0x1d0
> [<0>] __do_execve_file.isra.40+0x624/0x920
> [<0>] __x64_sys_execve+0x49/0x60
> [<0>] do_syscall_64+0x64/0x220
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> > (I remain very nervous about weakening cred_guard_mutex without
> > addressing the many many users...)
> > 
> 
> They need to be looked at closely, that's pretty clear.
> Most fall in the class, that just the current credentials need
> to stay stable for a certain time.

I remain rather set on wanting some very basic tests with this change.
Imho, looking through tools/testing/selftests again we don't have nearly
enough for these codepaths; not to say none. Basically, if someone wants
to make a change affecting the current problem we should really have at
least a single simple test/reproducer that can be run without digging
through lore. And hopefully over time we'll have more tests.

Christian



[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux