On Wed, Feb 14, 2018 at 8:30 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote: >> > > >> > > What's the reason for adding eBPF support? seccomp shouldn't need it, >> > > and it only makes the code more complex. I'd rather stick with cBPF >> > > until we have an overwhelmingly good reason to use eBPF as a "native" >> > > seccomp filter language. >> > > >> > >> > I can think of two fairly strong use cases for eBPF's ability to call >> > functions: logging and Tycho's user notifier thing. >> >> Worth noting that there is one additional thing that I didn't >> implement, but which would be nice and is probably not possible with >> eBPF (at least, not without a bunch of additional infrastructure): >> passing fds back to the tracee from the manager if you intercept >> socket(), or accept() or something. >> >> This could again be accomplished via other means, though it would be a >> lot nicer to have a primitive for it. > > there is bpf_perf_event_output() interface that allows to stream > arbitrary data from kernel into user space via perf ring buffer. > User space can epoll on it. We use this in both tracing and networking > for notifications and streaming data transfers. > I suspect this can be used for 'logging' too, since it's cheap and fast. > > Specifically for android we added bpf_lsm hooks, cookie/uid helpers, > and read-only maps. > Lorenzo, > there was a claim in this thread that bpf is disabled on android. > Can you please clarify ? > If it's actually disabled and there is no intent to enable it, > I'd rather not add any more android specific features to bpf. > > What I think is important to understand is that BPF goes through > very active development. The verifier is constantly getting smarter. > There is work to add bounded loops, lock/unlock, get/put tracking, > global/percpu variables, dynamic linking and so on. > Most of the features are available to root only and unpriv > has very limited set. Like getting bpf_perf_event_output() to work > for unpriv will likely require additional verifier work. > > So all cool bits will not be usable by seccomp+eBPF and unpriv > on day one. It's not a lot of work either, but once it's done > I'd hate to see arguments against adding more verifier features > just because eBPF is used by seccomp/landlock/other_security_thing. > > Also I think the argument that seccomp+eBPF will be faster than > seccomp+cBPF is a weak one. I bet kpti on/off makes no difference > under seccomp, since _all_ syscalls are already slow for sandboxed app. > Instead of making seccomp 5% faster with eBPF, I think it's > worth looking into extending LSM hooks to cover all syscalls and > have programmable (bpf or whatever) filtering applied per syscall. > Like we can have a white list syscall table covered by lsm hooks > and any other syscall will get into old seccomp-style > filtering category automatically. > lsm+bpf would need to follow process hierarchy. It shouldn't be > a runtime check at syscall entry either, but compile time > extra branch in SYSCALL_DEFINE for non-whitelisted syscalls. > There are bunch of other things to figure out, but I think > the perf win will be bigger than replacing cBPF with eBPF in > existing seccomp. > Given this test program: for (i = 10; i < 99999999; i++) syscall(__NR_getpid); If I implement an eBPF filter with PROG_ARRAYs, and tail call, the numbers are such: ebpf JIT 12.3% slower than native ebpf no JIT 13.6% slower than native seccomp JIT 17.6% slower than native seccomp no JIT 37% slower than native This is using libseccomp for the standard seccomp BPF program. There's no reasonable way for our workload to know which syscalls come "earlier", so we can't take that optimization. Potentially, libseccomp can be smarter about ordering cases (using ranges), and use an O(log(n)) search algorithm, but both of these are microptimizations that scale with the number of syscalls and per-syscall rules. The nicety of using a PROG_ARRAY means that adding additional filters (syscalls) comes at no cost, whereas there's a tradeoff any time you add another rule in traditional seccomp filters. This was tested on an Amazon M4.16XL running with pcid, and KPTI. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers