On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > One of the most common pain points with seccomp filters has been dealing > with the overhead of processing the filters, especially for "always allow" > or "always reject" cases. While BPF is extremely fast[1], it will always > have overhead associated with it. Additionally, due to seccomp's design, > filters are layered, which means processing time goes up as the number > of filters attached goes up. > > In the past, efforts have been focused on making filter execution complete > in a shorter amount of time. For example, filters were rewritten from > using linear if/then/else syscall search to using balanced binary trees, > or moving tests for syscalls common to the process's workload to the > front of the filter. However, there are limits to this, especially when > some processes are dealing with tens of filters[2], or when some > architectures have a less efficient BPF engine[3]. > > The most common use of seccomp, constructing syscall block/allow-lists, > where syscalls that are always allowed or always rejected (without regard > to any arguments), also tends to produce the most pathological runtime > problems, in that a large number of syscall checks in the filter need > to be performed to come to a determination. > > In order to optimize these cases from O(n) to O(1), seccomp can > use bitmaps to immediately determine the desired action. A critical > observation in the prior paragraph bears repeating: the common case for > syscall tests do not check arguments. For any given filter, there is a > constant mapping from the combination of architecture and syscall to the > seccomp action result. (For kernels/architectures without CONFIG_COMPAT, > there is a single architecture.). As such, it is possible to construct > a mapping of arch/syscall to action, which can be updated as new filters > are attached to a process. > > In order to build this mapping at filter attach time, each filter is > executed for every syscall (under each possible architecture), and > checked for any accesses of struct seccomp_data that are not the "arch" > nor "nr" (syscall) members. If only "arch" and "nr" are examined, then > there is a constant mapping for that syscall, and bitmaps can be updated > accordingly. If any accesses happen outside of those struct members, > seccomp must not bypass filter execution for that syscall, since program > state will be used to determine filter action result. > > During syscall action probing, in order to determine whether other members > of struct seccomp_data are being accessed during a filter execution, > the struct is placed across a page boundary with the "arch" and "nr" > members in the first page, and everything else in the second page. The > "page accessed" flag is cleared in the second page's PTE, and the filter > is run. If the "page accessed" flag appears as set after running the > filter, we can determine that the filter looked beyond the "arch" and > "nr" members, and exclude that syscall from the constant action bitmaps. > > For architectures to support this optimization, they must declare > their architectures for seccomp to see (via SECCOMP_ARCH and > SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient > CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()), > and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig. Wouldn't it be simpler to use a function that can run a subset of seccomp cBPF and bails out on anything that indicates that a syscall's handling is complex or on instructions it doesn't understand? For syscalls that have a fixed policy, a typical seccomp filter doesn't even use any of the BPF_ALU ops, the scratch space, or the X register; it just uses something like the following set of operations, which is easy to emulate without much code: BPF_LD | BPF_W | BPF_ABS BPF_JMP | BPF_JEQ | BPF_K BPF_JMP | BPF_JGE | BPF_K BPF_JMP | BPF_JGT | BPF_K BPF_JMP | BPF_JA BPF_RET | BPF_K Something like (completely untested): /* * Try to statically determine whether @filter will always return a fixed result * when run for syscall @nr under architecture @arch. * Returns true if the result could be determined; if so, the result will be * stored in @action. */ static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch, unsigned int nr, unsigned int *action) { int pc; unsigned int reg_value = 0; for (pc = 0; 1; pc++) { struct sock_filter *insn = &filter[pc]; u16 code = insn->code; u32 k = insn->k; switch (code) { case BPF_LD | BPF_W | BPF_ABS: if (k == offsetof(struct seccomp_data, nr)) { reg_value = nr; } else if (k == offsetof(struct seccomp_data, arch)) { reg_value = arch; } else { return false; /* can't optimize (non-constant value load) */ } break; case BPF_RET | BPF_K: *action = insn->k; return true; /* success: reached return with constant values only */ case BPF_JMP | BPF_JA: pc += insn->k; break; case BPF_JMP | BPF_JEQ | BPF_K: case BPF_JMP | BPF_JGE | BPF_K: case BPF_JMP | BPF_JGT | BPF_K: default: if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) { u16 op = BPF_OP(code); bool op_res; switch (op) { case BPF_JEQ: op_res = reg_value == k; break; case BPF_JGE: op_res = reg_value >= k; break; case BPF_JGT: op_res = reg_value > k; break; default: return false; /* can't optimize (unknown insn) */ } pc += op_res ? insn->jt : insn->jf; break; } return false; /* can't optimize (unknown insn) */ } } } That way, you won't need any of this complicated architecture-specific stuff.