On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@xxxxxxxxx> wrote: > > > > From: YiFei Zhu <yifeifz2@xxxxxxxxxxxx> > > > > Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html > > > > This patchset enables seccomp filters to be written in eBPF. > > Supporting eBPF filters has been proposed a few times in the past. > > The main concerns were (1) use cases and (2) security. We have > > identified many use cases that can benefit from advanced eBPF > > filters, such as: > > I haven't reviewed this carefully, but I think we need to distinguish > a few things: > > 1. Using the eBPF *language*. > > 2. Allowing the use of stateful / non-pure eBPF features. > > 3. Allowing the eBPF programs to read the target process' memory. > > I'm generally in favor of (1). I'm not at all sure about (2), and I'm > even less convinced by (3). > > > > > * exec-only-once filter / apply filter after exec > > This is (2). I'm not sure it's a good idea. The basic idea is that for a container runtime it may wait to execute a program in a container without that program being able to execve another program, stopping any attack that involves loading another binary. The container runtime can block any syscall but execve in the exec-ed process by using only cBPF. The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano. @Andrea and @Giuseppe, could you clarify more in case I missed something? > > * syscall logging (eg. via maps) > > This is (2). Probably useful, but doesn't obviously belong in > seccomp, or at least not as part of the same seccomp feature as > regular filtering. > > > * expressiveness & better tooling (no need for DSLs like easyseccomp) > > (1). Sounds good. > > > * contained syscall fault injection > > (2)? We can already do this with notifiers. To clarify, “we can already do with notifiers” isn’t the point here. We can do almost everything if you have notifiers and ptrace, but it may impose significant overhead (see the microbenchmark results). The reason I’m saying the overhead is important is for the reproduction / testing of certain race conditions. A syscall failing quickly in a userspace application could, from a race condition, have a completely different trace as a syscall failing after a few context switches. eBPF makes quick fault injection possible. > > For security, for an unprivileged caller, our implementation is as > > restrictive as user notifier + ptrace, in regards to capabilities. > > eBPF helpers follow the privilege model of original eBPF helpers. > > eBPF doesn't really have a privilege model yet. There was a long and > disappointing thread about this awhile back. The idea is that “seccomp-eBPF does not make life easier for an adversary”. Any attack an adversary could potentially utilize seccomp-eBPF, they can do the same with other eBPF features, i.e. it would be an issue with eBPF in general rather than specifically seccomp’s use of eBPF. Here it is referring to the helpers goes to the base bpf_base_func_proto if the caller is unprivileged (!bpf_capable || !perfmon_capable). In this case, if the adversary would utilize eBPF helpers to perform an attack, they could do it via another unprivileged prog type. That said, there are a few additional helpers this patchset is adding: * get_current_uid_gid * get_current_pid_tgid These two provide public information (are namespaces a concern?). I have no idea what kind of exploit it could add unless the adversary somehow side-channels the task_struct? But in that case, how is the reading of task_struct different from how the rest of the kernel is reading task_struct? Though, if knowing the global uid / pid is a concern then the eBPF progs will need to keep track of namespaces, and that might not be trivial. * probe_read_user * probe_read_user_str Reduction to ptrace. The privilege model of reading another process’s data (via process_vm_readv or ptrace(PTRACE_PEEK{TEXT,DATA})) is guarded by PTRACE_MODE_ATTACH_REALCREDS. However, unprivileged seccomp is safeguarded by no_new_privs, so, unless the caller have a non-uniform resuid & fsuid, in which case it’s the caller’s failure to relinquish privileges, ruid of the seccomp-eBPF executor (which is task whose syscalls is being filtered) would be the save as the ruid of the applier (the task that set the seccomp mode, at the time of setting it). The main concern here is LSMs. LSMs can further restrict the scope of ptrace hence I also allow LSMs to deny all “the use of stateful / non-pure eBPF features”. As for side channels... the copy_from_user_nofault may allow an adversary to observe what’s in resident memory and what’s swapped out, but the adversary can already do this by observing the timing of memory accesses. The non-nofault variant copy_from_user is used everywhere in the kernel, so if an adversary were to side channel the kernel by copy_from_user against an address, they can already do it by using a syscall with a pointer that would be used by copy_from_user. * task_storage_get * task_storage_delete This is what I’m least sure about. The implementation of task_storage is more complex than the other helpers, and also assumes a privileged eBPF loader. It would slightly extend the attack surface. If this is a big issue then eBPF can emulate such a map by using some hashmap and having PID as key... > > Moreover, a mechanism for reading user memory is added. The same > > prototypes of bpf_probe_read_user{,str} from tracing are used. However, > > when the loader of bpf program does not have CAP_PTRACE, the helper > > will return -EPERM if the task under seccomp filter is non-dumpable. > > The reason for this is that if we perform reduction from seccomp-eBPF > > to user notifier + ptrace, ptrace requires CAP_PTRACE to read from > > a non-dumpable process. However, eBPF does not solve the TOCTOU problem > > of user notifier, so users should not use this to enforce a policy > > based on memory contents. > > What is this for? Memory reading opens up lots of use cases. For example, logging what files are being opened without imposing too much performance penalty from strace. Or as an accelerator for user notify emulation, where syscalls can be rejected on a fast path if we know the memory contents does not satisfy certain conditions that user notify will check. YiFei Zhu