On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote: > On 3/3/20 6:29 AM, Kees Cook wrote: > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote: > >> On 3/3/20 3:26 AM, Kees Cook wrote: > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote: > >>>> [...] > >>> > >>> If I'm reading this patch correctly, this changes the lifetime of the > >>> cred_guard_mutex lock to be: > >>> - during prepare_bprm_creds() > >>> - from flush_old_exec() through install_exec_creds() > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through > >>> install_exec_creds(). > > > > BTW, I think the effect of this change (i.e. my paragraph above) should > > be distinctly called out in the commit log if this solution moves > > forward. > > > > Okay, will do. > > >>> That means, for example, that check_unsafe_exec()'s documented invariant > >>> is violated: > >>> /* > >>> * determine how safe it is to execute the proposed program > >>> * - the caller must hold ->cred_guard_mutex to protect against > >>> * PTRACE_ATTACH or seccomp thread-sync > >>> */ > >> > >> Oh, right, I haven't understood that hint... > > > > I know no_new_privs is checked there, but I haven't studied the > > PTRACE_ATTACH part of that comment. If that is handled with the new > > check, this comment should be updated. > > > > Okay, I change that comment to: > > /* > * determine how safe it is to execute the proposed program > * - the caller must have set ->cred_locked_in_execve to protect against > * PTRACE_ATTACH or seccomp thread-sync > */ > > >>> I think it also means that the potentially multiple invocations > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without > >>> a lock (another place where current's no_new_privs is evaluated). > >> > >> So no_new_privs can change from 0->1, but should not > >> when execve is running. > >> > >> As long as the calling thread is in execve it won't do this, > >> and the only other place, where it may set for other threads > >> is in seccomp_sync_threads, but that can easily be avoided see below. > > > > Yeah, everything was fine until I had to go complicate things with > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while > > later gaining a seccomp filter from an unpriv process. The no_new_privs > > flag was used to control this, but it required that the filter not get > > applied during exec. > > > >>> Related, it also means that cred_guard_mutex is unheld for every > >>> invocation of search_binary_handler() (which can loop via the previously > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() > >>> currently.) > >>> > >>> For seccomp, the expectations about existing thread states risks races > >>> too. There are two locks held for TSYNC: > >>> - current->sighand->siglock is held to keep new threads from > >>> appearing/disappearing, which would destroy filter refcounting and > >>> lead to memory corruption. > >> > >> I don't understand what you mean here. > >> How can this lead to memory corruption? > > > > Mainly this is a matter of how seccomp manages its filter hierarchy > > (since the filters are shared through process ancestry), so if a thread > > appears in the middle of TSYNC it may be racing another TSYNC and break > > ancestry, leading to bad reference counting on process death, etc. > > (Though, yes, with refcount_t now, things should never corrupt, just > > waste memory.) > > > > I assume for now, that the current->sighand->siglock held while iterating all > threads is sufficient here. > > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to > >>> avoid no_new_privs and filter confusion during exec, which could > >>> lead to exploitable setuid conditions (see below). > >>> > >>> Just racing a malicious thread during TSYNC is not a very strong > >>> example (a malicious thread could do lots of fun things to "current" > >>> before it ever got near calling TSYNC), but I think there is the risk > >>> of mismatched/confused states that we don't want to allow. One is a > >>> particularly bad state that could lead to privilege escalations (in the > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process > >>> has a filter attached that silently fails a priv-dropping setuid call > >>> and continues execution with elevated privs, it can be tricked into > >>> doing bad things on behalf of the unprivileged parent, which was the > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]): > >>> > >>> thread A clones thread B > >>> thread B starts setuid exec > >>> thread A sets no_new_privs > >>> thread A calls seccomp with TSYNC > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B > >>> thread B passes check_unsafe_exec() with no_new_privs unset > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B > >>> thread B finishes exec, now running with elevated privs, a filter chosen > >>> by thread A, _and_ nnp set (which doesn't matter) > >>> > >>> With the original locking, thread B will fail check_unsafe_exec() > >>> because filter and nnp state are changed together, with "atomicity" > >>> protected by the cred_guard_mutex. > >>> > >> > >> Ah, good point, thanks! > >> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace > >> while the cred_guard_mutex is locked, like this for instance: > >> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c > >> index b6ea3dc..377abf0 100644 > >> --- a/kernel/seccomp.c > >> +++ b/kernel/seccomp.c > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) > >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); > >> assert_spin_locked(¤t->sighand->siglock); > >> > >> + if (current->signal->cred_locked_for_ptrace) > >> + return -EAGAIN; > >> + > > > > Hmm. I guess something like that could work. TSYNC expects to be able to > > report _which_ thread wrecked the call, though... I wonder if in_execve > > could be used to figure out the offending thread. Hm, nope, that would > > be outside of lock too (and all users are "current" right now, so the > > lock wasn't needed before). > > > > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, > but the caller will die quickly and cannot do anything with that information > when another thread executes execve, right? > > >> /* Validate all threads being eligible for synchronization. */ > >> caller = current; > >> for_each_thread(caller, thread) { > >> > >> > >>> And this is just the bad state I _can_ see. I'm worried there are more... > >>> > >>> All this said, I do see a small similarity here to the work I did to > >>> stabilize stack rlimits (there was an ongoing problem with making multiple > >>> decisions for the bprm based on current's state -- but current's state > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored > >>> current's copy until exec ended and then stored bprm's copy into current. > >>> If the only problem anyone can see here is the handling of no_new_privs, > >>> we might be able to solve that similarly, at least disentangling tsync/nnp > >>> from cred_guard_mutex. > >>> > >> > >> I still think that is solvable with using cred_locked_for_ptrace and > >> simply make the tsync fail if it would otherwise be blocked. > > > > I wonder if we can find a better name than "cred_locked_for_ptrace"? > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something? > > > > Yeah, I'd go with "cred_locked_in_execve". > > > And the comment on bool cred_locked_for_ptrace should mention that > > access is only allowed under cred_guard_mutex lock. > > > > okay. > > >>>> + sig->cred_locked_for_ptrace = false; > > > > This is redundant to the zalloc -- I think you can drop it (unless > > someone wants to keep it for clarify?) > > > > I'll remove that here and in init/init_task.c > > > Also, I think cred_locked_for_ptrace needs checking deeper, in > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to > > be sufficient to see a stable version of the thread... > > > > No, these need to be addressed individually, but most users just want > to know if the current credentials are sufficient at this moment, but will > not change the credentials, as ptrace and TSYNC do. > > BTW: Not all users have cred_guard_mutex, see mm/migrate.c, > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. > So adding an access to cred_locked_for_execve in ptrace_may_access is > probably not an option. > > However, one nice added value by this change is this: > > void *thread(void *arg) > { > ptrace(PTRACE_TRACEME, 0,0,0); > return NULL; > } > > int main(void) > { > int pid = fork(); > > if (!pid) { > pthread_t pt; > pthread_create(&pt, NULL, thread, NULL); > pthread_join(pt, NULL); > execlp("echo", "echo", "passed", NULL); > } > > sleep(1000); > ptrace(PTRACE_ATTACH, pid, 0,0); > kill(pid, SIGCONT); > return 0; > } > > cat /proc/3812/stack > [<0>] flush_old_exec+0xbf/0x760 > [<0>] load_elf_binary+0x35a/0x16c0 > [<0>] search_binary_handler+0x97/0x1d0 > [<0>] __do_execve_file.isra.40+0x624/0x920 > [<0>] __x64_sys_execve+0x49/0x60 > [<0>] do_syscall_64+0x64/0x220 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > (I remain very nervous about weakening cred_guard_mutex without > > addressing the many many users...) > > > > They need to be looked at closely, that's pretty clear. > Most fall in the class, that just the current credentials need > to stay stable for a certain time. I remain rather set on wanting some very basic tests with this change. Imho, looking through tools/testing/selftests again we don't have nearly enough for these codepaths; not to say none. Basically, if someone wants to make a change affecting the current problem we should really have at least a single simple test/reproducer that can be run without digging through lore. And hopefully over time we'll have more tests. Christian