On Sun, Sep 20, 2020 at 10:35 PM YiFei Zhu <zhuyifei1999@xxxxxxxxx> wrote: > > From: YiFei Zhu <yifeifz2@xxxxxxxxxxxx> > > This series adds a bitmap to cache seccomp filter results if the > result permits a syscall and is indepenent of syscall arguments. > This visibly decreases seccomp overhead for most common seccomp > filters with very little memory footprint. > > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We propose SECCOMP_CACHE, a cache-based solution to minimize the > Seccomp overhead. The basic idea is to cache the result of each > syscall check to save the subsequent overhead of executing the > filters. This is feasible, because the check in Seccomp is stateless. > The checking results of the same syscall ID and argument remains > the same. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. > > In the past Kees proposed [2] to have an "add this syscall to the > reject bitmask". It is indeed much easier to securely make a reject > accelerator to pre-filter syscalls before passing to the BPF > filters, considering it could only strengthen the security provided > by the filter. However, ultimately, filter rejections are an > exceptional / rare case. Here, instead of accelerating what is > rejected, we accelerate what is allowed. In order not to compromise > the security rules the BPF filters defined, any accept-side > accelerator must complement the BPF filters rather than replacing them. > > Statically analyzing BPF bytecode to see if each syscall is going to > always land in allow or reject is more of a rabbit hole, especially > there is no current in-kernel infrastructure to enumerate all the > possible architecture numbers for a given machine. So rather than > doing that, we propose to cache the results after the BPF filters are > run. And since there are filters like docker's who will check > arguments of some syscalls, but not all or none of the syscalls, when > a filter is loaded we analyze it to find whether each syscall is > cacheable (does not access syscall argument or instruction pointer) by > following its control flow graph, and store the result for each filter > in a bitmap. Changes to architecture number or the filter are expected > to be rare and simply cause the cache to be cleared. This solution > shall be fully transparent to userspace. Long-term, do you believe static analysis will be viable? I think that it is the "ideal" solution here, but I agree in that it is more complex. Is there a way to "prime" filters, by giving them a syscall #, and if it has a terminal condition without inspecting args, it turns into a bitmask entry viable? > > Ongoing work is to further support arguments with fast hash table > lookups. We are investigating the performance of doing so [6], and how > to best integrate with the existing seccomp infrastructure. > > We have done some benchmarks with patch applied against bpf-next > commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args"). > > Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, > average results: > > Without cache, seccomp_benchmark: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.079642020 - 1.013345439 = 15066296581 (15.1s) > getpid native: 641 ns > 32.080237410 - 16.080763500 = 15999473910 (16.0s) > getpid RET_ALLOW 1 filter: 681 ns > 48.609461618 - 32.081296173 = 16528165445 (16.5s) > getpid RET_ALLOW 2 filters: 703 ns > Estimated total seccomp overhead for 1 filter: 40 ns > Estimated total seccomp overhead for 2 filters: 62 ns > Estimated seccomp per-filter overhead: 22 ns > Estimated seccomp entry overhead: 18 ns > > With cache: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.059512499 - 1.014108434 = 15045404065 (15.0s) > getpid native: 640 ns > 31.651075934 - 16.060637323 = 15590438611 (15.6s) > getpid RET_ALLOW 1 filter: 663 ns > 47.367316169 - 31.652302661 = 15715013508 (15.7s) > getpid RET_ALLOW 2 filters: 669 ns > Estimated total seccomp overhead for 1 filter: 23 ns > Estimated total seccomp overhead for 2 filters: 29 ns > Estimated seccomp per-filter overhead: 6 ns > Estimated seccomp entry overhead: 17 ns > > Depending on the run estimated seccomp overhead for 2 filters can be > less than seccomp overhead for 1 filter, resulting in underflow to > estimated seccomp per-filter overhead: > Estimated total seccomp overhead for 1 filter: 27 ns > Estimated total seccomp overhead for 2 filters: 21 ns > Estimated seccomp per-filter overhead: 18446744073709551610 ns > Estimated seccomp entry overhead: 33 ns > > Jack Chen has also run some benchmarks on a bare metal > Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel > mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off > pti=off l1tf=off), with BPF JIT on and docker default profile, > and reported: > > unixbench syscall mix (https://github.com/kdlucas/byte-unixbench) > unconfined: 33295685 > docker default: 20661056 60% > docker default + cache: 25719937 30% > > Patch 1 introduces the static analyzer to check for a given filter, > whether the CFG loads the syscall arguments for each syscall number. > > Patch 2 implements the bitmap cache. > > [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@xxxxxxxxxx/T/ > [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ > [3] https://github.com/seccomp/libseccomp/issues/116 > [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json > [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 > [6] Draco: Architectural and Operating System Support for System Call Security > https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 > > YiFei Zhu (2): > seccomp/cache: Add "emulator" to check if filter is arg-dependent > seccomp/cache: Cache filter results that allow syscalls > > arch/x86/Kconfig | 27 +++ > include/linux/seccomp.h | 22 +++ > kernel/seccomp.c | 400 +++++++++++++++++++++++++++++++++++++++- > 3 files changed, 446 insertions(+), 3 deletions(-) > > -- > 2.28.0 > _______________________________________________ > Containers mailing list > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx > https://lists.linuxfoundation.org/mailman/listinfo/containers _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers