On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote: > On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote: > > On 3/3/20 6:29 AM, Kees Cook wrote: > > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote: > > >> On 3/3/20 3:26 AM, Kees Cook wrote: > > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote: > > >>>> [...] > > >>> > > >>> If I'm reading this patch correctly, this changes the lifetime of the > > >>> cred_guard_mutex lock to be: > > >>> - during prepare_bprm_creds() > > >>> - from flush_old_exec() through install_exec_creds() > > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through > > >>> install_exec_creds(). > > > > > > BTW, I think the effect of this change (i.e. my paragraph above) should > > > be distinctly called out in the commit log if this solution moves > > > forward. > > > > > > > Okay, will do. > > > > >>> That means, for example, that check_unsafe_exec()'s documented invariant > > >>> is violated: > > >>> /* > > >>> * determine how safe it is to execute the proposed program > > >>> * - the caller must hold ->cred_guard_mutex to protect against > > >>> * PTRACE_ATTACH or seccomp thread-sync > > >>> */ > > >> > > >> Oh, right, I haven't understood that hint... > > > > > > I know no_new_privs is checked there, but I haven't studied the > > > PTRACE_ATTACH part of that comment. If that is handled with the new > > > check, this comment should be updated. > > > > > > > Okay, I change that comment to: > > > > /* > > * determine how safe it is to execute the proposed program > > * - the caller must have set ->cred_locked_in_execve to protect against > > * PTRACE_ATTACH or seccomp thread-sync > > */ > > > > >>> I think it also means that the potentially multiple invocations > > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and > > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without > > >>> a lock (another place where current's no_new_privs is evaluated). > > >> > > >> So no_new_privs can change from 0->1, but should not > > >> when execve is running. > > >> > > >> As long as the calling thread is in execve it won't do this, > > >> and the only other place, where it may set for other threads > > >> is in seccomp_sync_threads, but that can easily be avoided see below. > > > > > > Yeah, everything was fine until I had to go complicate things with > > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while > > > later gaining a seccomp filter from an unpriv process. The no_new_privs > > > flag was used to control this, but it required that the filter not get > > > applied during exec. > > > > > >>> Related, it also means that cred_guard_mutex is unheld for every > > >>> invocation of search_binary_handler() (which can loop via the previously > > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden > > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() > > >>> currently.) > > >>> > > >>> For seccomp, the expectations about existing thread states risks races > > >>> too. There are two locks held for TSYNC: > > >>> - current->sighand->siglock is held to keep new threads from > > >>> appearing/disappearing, which would destroy filter refcounting and > > >>> lead to memory corruption. > > >> > > >> I don't understand what you mean here. > > >> How can this lead to memory corruption? > > > > > > Mainly this is a matter of how seccomp manages its filter hierarchy > > > (since the filters are shared through process ancestry), so if a thread > > > appears in the middle of TSYNC it may be racing another TSYNC and break > > > ancestry, leading to bad reference counting on process death, etc. > > > (Though, yes, with refcount_t now, things should never corrupt, just > > > waste memory.) > > > > > > > I assume for now, that the current->sighand->siglock held while iterating all > > threads is sufficient here. > > > > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to > > >>> avoid no_new_privs and filter confusion during exec, which could > > >>> lead to exploitable setuid conditions (see below). > > >>> > > >>> Just racing a malicious thread during TSYNC is not a very strong > > >>> example (a malicious thread could do lots of fun things to "current" > > >>> before it ever got near calling TSYNC), but I think there is the risk > > >>> of mismatched/confused states that we don't want to allow. One is a > > >>> particularly bad state that could lead to privilege escalations (in the > > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process > > >>> has a filter attached that silently fails a priv-dropping setuid call > > >>> and continues execution with elevated privs, it can be tricked into > > >>> doing bad things on behalf of the unprivileged parent, which was the > > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]): > > >>> > > >>> thread A clones thread B > > >>> thread B starts setuid exec > > >>> thread A sets no_new_privs > > >>> thread A calls seccomp with TSYNC > > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B > > >>> thread B passes check_unsafe_exec() with no_new_privs unset > > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs > > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B > > >>> thread B finishes exec, now running with elevated privs, a filter chosen > > >>> by thread A, _and_ nnp set (which doesn't matter) > > >>> > > >>> With the original locking, thread B will fail check_unsafe_exec() > > >>> because filter and nnp state are changed together, with "atomicity" > > >>> protected by the cred_guard_mutex. > > >>> > > >> > > >> Ah, good point, thanks! > > >> > > >> This can be fixed by checking current->signal->cred_locked_for_ptrace > > >> while the cred_guard_mutex is locked, like this for instance: > > >> > > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > >> index b6ea3dc..377abf0 100644 > > >> --- a/kernel/seccomp.c > > >> +++ b/kernel/seccomp.c > > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) > > >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); > > >> assert_spin_locked(¤t->sighand->siglock); > > >> > > >> + if (current->signal->cred_locked_for_ptrace) > > >> + return -EAGAIN; > > >> + > > > > > > Hmm. I guess something like that could work. TSYNC expects to be able to > > > report _which_ thread wrecked the call, though... I wonder if in_execve > > > could be used to figure out the offending thread. Hm, nope, that would > > > be outside of lock too (and all users are "current" right now, so the > > > lock wasn't needed before). > > > > > > > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, > > but the caller will die quickly and cannot do anything with that information > > when another thread executes execve, right? > > > > >> /* Validate all threads being eligible for synchronization. */ > > >> caller = current; > > >> for_each_thread(caller, thread) { > > >> > > >> > > >>> And this is just the bad state I _can_ see. I'm worried there are more... > > >>> > > >>> All this said, I do see a small similarity here to the work I did to > > >>> stabilize stack rlimits (there was an ongoing problem with making multiple > > >>> decisions for the bprm based on current's state -- but current's state > > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored > > >>> current's copy until exec ended and then stored bprm's copy into current. > > >>> If the only problem anyone can see here is the handling of no_new_privs, > > >>> we might be able to solve that similarly, at least disentangling tsync/nnp > > >>> from cred_guard_mutex. > > >>> > > >> > > >> I still think that is solvable with using cred_locked_for_ptrace and > > >> simply make the tsync fail if it would otherwise be blocked. > > > > > > I wonder if we can find a better name than "cred_locked_for_ptrace"? > > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something? > > > > > > > Yeah, I'd go with "cred_locked_in_execve". > > > > > And the comment on bool cred_locked_for_ptrace should mention that > > > access is only allowed under cred_guard_mutex lock. > > > > > > > okay. > > > > >>>> + sig->cred_locked_for_ptrace = false; > > > > > > This is redundant to the zalloc -- I think you can drop it (unless > > > someone wants to keep it for clarify?) > > > > > > > I'll remove that here and in init/init_task.c > > > > > Also, I think cred_locked_for_ptrace needs checking deeper, in > > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make > > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to > > > be sufficient to see a stable version of the thread... > > > > > > > No, these need to be addressed individually, but most users just want > > to know if the current credentials are sufficient at this moment, but will > > not change the credentials, as ptrace and TSYNC do. > > > > BTW: Not all users have cred_guard_mutex, see mm/migrate.c, > > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. > > So adding an access to cred_locked_for_execve in ptrace_may_access is > > probably not an option. > > > > However, one nice added value by this change is this: > > > > void *thread(void *arg) > > { > > ptrace(PTRACE_TRACEME, 0,0,0); > > return NULL; > > } > > > > int main(void) > > { > > int pid = fork(); > > > > if (!pid) { > > pthread_t pt; > > pthread_create(&pt, NULL, thread, NULL); > > pthread_join(pt, NULL); > > execlp("echo", "echo", "passed", NULL); > > } > > > > sleep(1000); > > ptrace(PTRACE_ATTACH, pid, 0,0); > > kill(pid, SIGCONT); > > return 0; > > } > > > > cat /proc/3812/stack > > [<0>] flush_old_exec+0xbf/0x760 > > [<0>] load_elf_binary+0x35a/0x16c0 > > [<0>] search_binary_handler+0x97/0x1d0 > > [<0>] __do_execve_file.isra.40+0x624/0x920 > > [<0>] __x64_sys_execve+0x49/0x60 > > [<0>] do_syscall_64+0x64/0x220 > > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > > (I remain very nervous about weakening cred_guard_mutex without > > > addressing the many many users...) > > > > > > > They need to be looked at closely, that's pretty clear. > > Most fall in the class, that just the current credentials need > > to stay stable for a certain time. > > I remain rather set on wanting some very basic tests with this change. > Imho, looking through tools/testing/selftests again we don't have nearly > enough for these codepaths; not to say none. Basically, if someone wants > to make a change affecting the current problem we should really have at > least a single simple test/reproducer that can be run without digging > through lore. And hopefully over time we'll have more tests. Which you added in v4. Which is great! (I should've mentioned this in my first mail.) Christian