On Fri, Sep 4, 2020 at 5:36 PM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote: > > On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > > > 1) why don't you enforce the block of kernel initiated faults with > > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just > > an optimization to remove a few instructions per syscall in the bpf > > execution of Android unprivileged apps? You should block a lot of > > other syscalls by default to all unprivileged processes, including > > vmsplice. > > > > In other words if it's just for Android, why can't Android solve it > > with only patch 1/2 by tweaking the seccomp filter? > > I would let Nick (nnk@) and Jeff (jeffv@) respond to this. > > The previous responses from both of them on this email thread > (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@xxxxxxxxxxxxxx/ > and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@xxxxxxxxxxxxxx/) > suggest that the performance overhead of seccomp-bpf is too much. Kees > also objected to it > (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/) > > I'm not familiar with how seccomp-bpf works. All that I can add here > is that userfaultfd syscall is usually not invoked in a performance > critical code path. So, if the performance overhead of seccomp-bpf (if > enabled) is observed on all syscalls originating from a process, then > I'd say patch 2/2 is essential. Otherwise, it should be ok to let > seccomp perform the same functionality instead. > There are two primary reasons why seccomp isn't viable here. 1) Seccomp was never designed for whole-of-system protections, and is impractical to deploy for anything other than "leaf" processes. 2) Attempts to enable seccomp on Android have run into performance problems, even for trivial seccomp filters. Let's go into each one. Issue #1: Seccomp was never designed for whole-of-system protections, and is impractical to deploy for anything other than "leaf" processes. Andrea suggests deploying a seccomp filter purely focused on Android unprivileged[1] (third party installed) apps. However, the intention is for this security control to be used system-wide[2]. Only processes which have a need for kernel initiated faults should be allowed to use them; all other processes should be denied by default. And when I say "all' processes, I mean "all" processes, even those which run with UID=0. Andrea's proposal is akin to a denylist, where many modern distributions (such as Android) use allowlists. The seemingly obvious solution is to apply a global seccomp filter in init (PID=1), but it falls down in practice. Seccomp is an incredibly useful tool, but it wasn't designed to be applied system-wide. Seccomp is fundamentally hierarchical in nature. A seccomp filter, once applied, cannot be subsequently relaxed or removed in child processes. While this restriction is great for leaf processes, it causes problems for OS designers - a parent process must maintain an unused capability if any process in the parent's process tree uses that capability. This makes applying a userfaultfd seccomp filter in init impossible, since we expect a few of init's children (but not init itself or most of init's children) to use userfaultfd kernel faults. We end up back to a wack-a-mole (denylist) problem of trying to modify each individual process to block userfaultfd kernel faults, defeating the goals of system-wide protection, and introducing significant complexity into the system design. Seccomp should be used in the context where it provides the most value -- process leaf nodes. But trying to apply seccomp as a system-wide control just isn't viable. Lokesh's sysctl proposal doesn't have these problems. When the sysctl is set to 2 by the OS distributor, all processes which don't have CAP_SYS_PTRACE are denied kernel generated faults, making the system safe-by-default. Only processes which are on the OS distributor's CAP_SYS_PTRACE allowlist (see Android's allowlist at [3]) can generate these faults, and capabilities can be managed without regards to process hierarchy. This keeps the system minimally privileged and safe. Seccomp isn't a viable solution here. Issue #2: Attempts to enable seccomp on Android globally have run into performance problems, even for trivial seccomp filters. Android has tried a few times to enable seccomp globally, but even excluding the above-mentioned hierarchical process problems, we've seen performance regressions across the board. Imposing a seccomp filter merely for userfaultfd imposes a tax on every syscall, even if the process never makes use of userfaultfd. Lokesh's sysctl proposal avoids this tax and places the check where it's most effective, with the rest of the userfaultfd functionality. See also the threads that Lokesh mentioned above: * https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@xxxxxxxxxxxxxx/ * https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@xxxxxxxxxxxxxx/ * https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/ Thanks, -- Nick [1] The use of the term "unprivileged" is unfortunate. In Android, there's no coarse-grain privileged vs unprivileged process. Each process, including root processes, have only the privileges they need, and not a bit more. As a concrete example, Android's init process (PID=1) is not allowed to open TCP/UDP sockets, but is allowed to spawn children which can do so. Having each process be differently privileged, and ensuring that functionality is only given out on a need-to-have basis, is an important part of modern OS design. [2] The trend in modern exploits isn't to perform attacks directly from untrusted code to the kernel. A lot of the attack surface needed by an attacker isn't reachable directly from untrusted code, but only indirectly through other processes. The attacker moves laterally through the system, exploiting a process which has the necessary capabilities, then escalating to the kernel. Enforcing security controls system-wide is an important part of denying an attacker the tools for an effective exploit and preventing this kind of lateral movement from being useful. Denying an attacker access to kernel initiated faults in userfaultfd system-wide (except for authorized processes) is doubly important, as these kinds of faults are extremely valuable to an exploit writer (see explanation at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0 or https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit) [3] https://android.googlesource.com/platform/system/sepolicy/+/7be9e9e372c70a5518f729a0cdcb0d39a28be377/private/domain.te#107 line 107 -- Nick Kralevich | nnk@xxxxxxxxxx