Hi Jann! On Tue, Oct 19, 2021 at 07:35:49PM +0200, jannh@xxxxxxxxxx wrote: > [...] I also want to thank Kees > Cook (https://twitter.com/kees_cook) for providing feedback on an earlier > version of this post (again, without implying that he necessarily agrees with > everything), [...] Thanks for sending this! It's going to make a great reference to aim people at to help them understand why (and how) data-only attacks can be so tricky to deal with. :) I'll reply to the bits I'd commented on before with your earlier drafts, now that it's published... > Attack stage: Freeing the object's page to the page allocator > [...] > Attack stage: Reallocating the victim page as a pagetable > [...] > Note that nothing in this whole exploit requires us to leak any kernel-virtual > or physical addresses, partly because we have an increment primitive instead of > a plain write; and it also doesn't involve directly influencing the instruction > pointer. Yup, it's a really nice walk-through on how to get deterministic control over the allocations. The idea of quarantines came up before[1], and you quickly showed how to defeat them. I wonder if there might still be a solution near this idea, though. Gaining type-awareness and (as you'd suggested before) pinning kernel address regions into specific types of allocation (as you discuss later) seems promising. > [...] > Still, in practice, I believe that attack surface reduction mechanisms > (especially seccomp) are currently some of the most important defense > mechanisms on Linux. We agree fully on this. :) I think MAC (e.g. SELinux) has a role here as well; Android (which uses both) has shown very clearly how reachability becomes a determining factor in limiting exploitation. > Against bugs in source code: Compile-time locking validation > [...] > The one big downside is that this requires getting the development community > for the codebase on board with the idea of backfilling and maintaining such > annotations. And someone has to write the analysis tooling that can verify the > annotations. True, but this seems like a reasonable project -- all the things that improve robustness have non-security benefits too. :) > Against exploit primitives: Attack primitive reduction via syscall restrictions > ------------------------------------------------------------------------------- > (Yes, I made up that name because I thought that capturing this under "Attack > surface reduction" is too muddy.) I don't think it needs to be limited to just "syscall restrictions". > [...] > Attack primitive reduction limits access to code that is suspected or known to > provide (sometimes very specific) exploitation primitives. For example, one > might decide to specifically forbid access to FUSE and userfaultfd for most > code because of their utility for kernel exploitation, and, if one of those > interfaces is truly needed, design a workaround that avoids exposing the attack > primitive to userspace. This is different from attack surface reduction, where > it often makes sense to permit access to any feature that a legitimate workload > wants to use. Agreed -- there are even things in the kernel that aren't exposed at all to userspace that make attacks easier, and there's no reason to keep them around. (e.g. the refactoring of struct_timer.) As I discuss later, I think CFI falls into this category -- it tries to plug a C compiler weakness, in the sense that current machine code cares not at all about function prototypes. And while this doesn't matter for normally running the code, it DOES matter when someone is trying to abuse the results. > A nice example of an attack primitive reduction is the sysctl > `vm.unprivileged_userfaultfd`, which was first introduced > (https://git.kernel.org/linus/cefdca0a86be) so that userfaultfd can > be made completely inaccessible to normal users and was then later > adjusted(https://git.kernel.org/linus/d0d4730ac2e4) so that users can be > granted access to part of its functionality without gaining the dangerous > attack primitive. (But if you can create unprivileged user namespaces, you > can still use FUSE to get an equivalent effect.) Right, and given further tightening, FUSE could go away too. Or maybe a system isn't built with FUSE at all. Narrowing the scope will have a meaningful impact on some subset of systems. > Against oops-based oracles: Lockout or panic on crash > [...] > that. On the other hand, if some service crashes on a desktop system, perhaps > that shouldn't cause the whole system to immediately go down and make you lose > unsaved state - so `panic_on_oops` might be too drastic there. > > A good solution to this might require a more fine-grained approach. [...] I agree. This is a place where Linus's opinions are very strong[2], which makes feature creation a bit of a minefield. :( I am open to ideas, and would love to see things explored. There were some alternative approach taken in the recent brute-force-defense series[3] and the proposed pkill_on_warn patch[4]. > Against UAF access: Deterministic UAF mitigation > [...] > In my opinion, this demonstrates that while UAF mitigations do have a lot of > value (and would have reliably prevented exploitation of this specific bug), > **a use-after-free is just one possible consequence of the symptom class > "object state confusion"** (which may or may not be the same as the bug class > of the root cause). It would be even better to enforce rules on object states, > and ensure that an object e.g. can't be accessed through a "refcounted" > reference anymore after the refcount has reached zero and has logically > transitioned into a state like "non-RCU members are exclusively owned by thread > performing teardown" or "RCU callback pending, non-RCU members are > uninitialized" or "exclusive access to RCU-protected members granted to thread > performing teardown, other members are uninitialized". Of course, doing this as > a runtime mitigation would be even costlier and messier than a reliable UAF > mitigation; this level of protection is probably only realistic with at least > some level of annotations and static validation. I think that hardware memory tagging (e.g. ARM's MTE) will have a big impact in this area. I remain nervous about there being enough bits to provide sufficiently versioned access to memory, but I think clever application of tagging can keep out the worst of the confusions. > Against UAF access: Probabilistic UAF mitigation; pointer leaks > [...] > In both these cases, explicitly stripping tag bits would be an acceptable > workaround because a pointer without tag bits still uniquely identifies a > memory location; and given that these are very special interfaces that > intentionally expose some degree of information about kernel pointers to > userspace, it would be reasonable to adjust this code manually. Please send a patch for this. :) (But seriously, any paths in the kernel where tags should be cleared but aren't need to be fixed.) > A somewhat more interesting example is the behavior of this piece of userspace > code: > [...] > So the values we're seeing have been ordered based on the virtual address of > the corresponding `struct file`; and SLUB allocates `struct file` from order-1 > [...] > With that knowledge, we can transform those numbers a bit, to show the order in > which objects were allocated inside each page (excluding pages for which we > haven't seen all allocations): > [...] > And these sequences are almost the same, except that they have been rotated > around by different amounts. This is exactly the SLUB freelist randomization > scheme, as introduced in commit 210e7a43fa905 > (https://git.kernel.org/linus/210e7a43fa905)! > [...] > So in summary, we can bypass SLUB randomization for the slab from which `struct > file` is allocated because someone used it as a lookup key in a specific type > of data structure. This is already fairly undesirable if SLUB randomization is > supposed to provide protection against some types of local attacks for all > slabs. > [...] > If we introduce a probabilistic use-after-free mitigation that relies on > attackers not being able to learn whether the uppermost bits of an object's > address changed after it was reallocated, this data structure could also break > that. This case is messier than things like `kcmp()` because here the address > ordering leak stems from a standard data structure. This is just horribly beautiful. I reminds me of[5], which shows how /proc directory entries are stored in string length order. I have no idea what the best approach to sanitizing this is going to be... > Control Flow Integrity > ---------------------- > **I want to explicitly point out that kernel Control Flow Integrity would have > had no impact at all on this exploit strategy**. By using a data-only strategy, > we avoid having to leak addresses, avoid having to find ROP gadgets for a > specific kernel build, and are completely unaffected by any defenses that > attempt to protect kernel code or kernel control flow. Things like getting > access to arbitrary files, increasing the privileges of a process, and so on > don't require kernel instruction pointer control. > > > Like in my last blogpost on Linux kernel exploitation > (https://googleprojectzero.blogspot.com/2020/02/mitigations-are-attack-surface-too.html) > (which was about a buggy subsystem that an Android vendor added to their > downstream kernel), to me, a data-only approach to exploitation feels very > natural and seems less messy than trying to hijack control flow anyway. > > > Maybe things are different for userspace code; but for attacks by userspace > against the kernel, I don't currently see a lot of utility in CFI because it > typically only affects one of many possible methods for exploiting a bug. > (Although of course there could be specific cases where a bug can only be > exploited by hijacking control flow, e.g. if a type confusion only permits > overwriting a function pointer and none of the permitted callees make > assumptions about input types or privileges that could be broken by changing > the function pointer.) I agree that CFI tends to be quite "late" in many attack scenarios, but I think we don't agree on the value proposition. :) To use your earlier terms, I view CFI as an "attack primitive reduction" method. While it would be great to have a distinct way to just block the root cause of flaws, it's not always possible to cover everything, so there is a benefit it adding "attack primitive reduction" features. And, FWIW, I think the kernel continues to take meaningful steps to squash these "early" flaw sources, e.g. VLA removal, introduction of refcount_t, FORTIFY_SOURCE, implicit-fallthrough removal, UBSAN_BOUNDS, etc. Working on attack primitive reduction doesn't preclude working on making other things more robust against failure. Attack primitive reduction forces attacks into specific categories, narrowing their scope/behavior in the process. (e.g. implementing non-executable memory didn't stop all kernel exploits, but it forced many attacks into the remaining writable+executable memory, making these kinds of things tractable to audit (e.g. CONFIG_DEBUG_WX.) CFI in particular strengthens the "intended" call graph as described in the C source, compared to the prototype-agnostic "just call into an address" that results after compilation. No, it is not perfect, but it does narrow the avenue of attack, and allows for the creation of defenses that cover the resulting gap. > Making important data readonly > [...] > The problem I see with this approach is that a large portion of the things a > kernel does are, in some way, critical to the correct functioning of the system > and system security. MMU state management, task scheduling, memory allocation, > filesystems > (https://googleprojectzero.blogspot.com/2020/02/mitigations-are-attack-surface-too.html), > page cache, IPC, ... - if any one of these parts of the kernel is corrupted > sufficiently badly, an attacker will probably be able to gain access to all > user data on the system, or use that corruption to feed bogus inputs into one > of the subsystems whose own data structures are read-only. Yes, given unlimited resources, even the narrowest of flaws can ultimately lead to total system compromise. I don't think, however, that this is a useful way to examine the benefit of defenses. Just like the rest of software engineering, security defenses are evolutionary. There isn't going to be a single fix that makes everything safe, but rather a series of changes that break down the problem into smaller pieces that can be dealt with progressively. I think there is value in removing targets (especially PTEs) from the writable-at-rest memory set: it's another attack surface reduction. It seems like what you're saying is "the attack surface is so huge there's no hope of actually removing enough surface to make a difference." I don't agree with this, though, since each attack surface has different shapes and behaviors. Not all attacks provide the same levels of control over the exposed pathological behavior available for abuse. And by forcing some design on the memory accesses, we can challenge some of the beliefs about how data structures should be classified -- we can start to carve up the giant bucket of kernel heap memory into separate pieces with documented security boundaries, etc. And as I've suggested before, attack surface removal appears to have meaningful impact on exploit development costs. For example, even something as course-grained as CFG frustrated Tavis a while back. (I'm not saying he couldn't have found a solution -- I know better -- but rather that it was going to take more time and he didn't want to spend it then.) > [...] > I think that the current situation of software security could be dramatically > improved - in a world where a little bug in some random kernel subsystem can > lead to a full system compromise, the kernel can't provide reliable security > isolation. Security engineers should be able to focus on things like buggy > permission checks and core memory management correctness, and not have to spend > their time dealing with issues in code that ought to not have any relevance to > system security. Agreed -- this is why shedding as much of C's dangers is where I've been trying to focus recent efforts, and I've been delighted to see the Rust efforts coming in to remove C entirely. :) Thanks again for this excellent write-up! -Kees [1] https://lore.kernel.org/lkml/ace0028d-99c6-cc70-accf-002e70f8523b@xxxxxxxxx/ [2] https://lore.kernel.org/lkml/CA+55aFy6jNLsywVYdGp83AMrXBo_P-pkjkphPGrO=82SPKCpLQ@xxxxxxxxxxxxxx/ [3] https://lore.kernel.org/lkml/20210307113031.11671-6-john.wood@xxxxxxx/ [4] https://lore.kernel.org/lkml/20210929185823.499268-1-alex.popov@xxxxxxxxx/ [5] https://twitter.com/_monoid/status/1449321535869788162 -- Kees Cook