Agreed. I like the idea, but we'll have to maintain backwards compat at the docker/runc level... but doesn't mean it shouldn't be added. It may just take a long time to add support. On Tue, Feb 13, 2018 at 12:02 PM, Jessie Frazelle <me@xxxxxxxxxxxx> wrote: > On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun@xxxxxxxxx> wrote: > > On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook@xxxxxxxxxxxx> > wrote: > >> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@xxxxxxxxx> > wrote: > >>> This patchset enables seccomp filters to be written in eBPF. Although, > >>> this patchset doesn't introduce much of the functionality enabled by > >>> eBPF, it lays the ground work for it. > >>> > >>> It also introduces the capability to dump eBPF filters via the PTRACE > >>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. > >>> In the attached samples, there's an example of this. One can then use > >>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, > >>> and use that at reload time. > >>> > >>> The primary reason for not adding maps support in this patchset is > >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > >>> If we have a map that the BPF program can read, it can potentially > >>> "change" privileges after running. It seems like doing writes only > >>> is safe, because it can be pure, and side effect free, and therefore > >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > >>> to an agreement, this can be in a follow-up patchset. > >> > >> What's the reason for adding eBPF support? seccomp shouldn't need it, > >> and it only makes the code more complex. I'd rather stick with -- cBPF > >> until we have an overwhelmingly good reason to use eBPF as a "native" > >> seccomp filter language. > >> > >> -Kees > >> > > Three reasons: > > 1) The userspace tooling for eBPF is much better than the user space > > tooling for cBPF. Our use case is specifically to optimize Docker > > policies. This is roughly what their seccomp policy looks like: > > https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. > > It would be much nicer to be able to leverage eBPF to write this in C, > > or any other the other languages targetting eBPF. In addition, if we > > have write-only maps, we can exfiltrate information from seccomp, like > > arguments, and errors in a relatively cheap way compared to cBPF, and > > then extract this via the bcc stack. Writing cBPF via C macros is a > > pain, and the off the shelf cBPF libraries are getting no love. The > > eBPF community is *exploding* with contributions. > > Is stage two of this getting runc to support eBPF and docker to change > the default to be written as eBPF, because I foresee that being a > problem mainly with the kernel versions people use. The point of that > patch was to help the most people and as your point in (2) is made > about performance, that is a trade-off I would be willing to make in > order to have this functionality on more kernel versions. > > The other alternative would be to have docker translate to use eBPF if > the kernel supported it, but that amount of complexity seems a bit > unnecessary for a feature that was trying to also be "simple". > > Or do you plan on wrapping filters onto processes tangentially from > the runtime, in which case, that should be totally fine :) > > Anyways this is kinda a tangent from the main point of getting it in > the kernel, just I would hate to see someone having to maintain this > without there being a path to getting it upstream elsewhere. > > > > > 2) In my testing, which thus so far has been very rudimentary, with > > rewriting the policy that libseccomp generates from the Docker policy > > to use eBPF, and eBPF maps performs much better than cBPF. The > > specific case tested was to use a bpf array to lookup rules for a > > particular syscall. In a super trivial test, this was about 5% low > > latency than using traditional branches. If you need more evidence of > > this, I can work a little bit more on the maps related patches, and > > see if I can get some more benchmarking. From my understanding, we > > would need to add "sealing" support for maps, in which they can be > > marked as read-only, and only at that point should an eBPF seccomp > > program be able to read from them. > > > > 3) Eventually, I'd like to use some more advanced capabilities of > > eBPF, like being able to rewrite arguments safely (not things referred > > to by pointers, but just plain old arguments). > > > >>> > >>> > >>> Sargun Dhillon (3): > >>> bpf, seccomp: Add eBPF filter capabilities > >>> seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp > >>> filters > >>> bpf: Add eBPF seccomp sample programs > >>> > >>> arch/Kconfig | 7 ++ > >>> include/linux/bpf_types.h | 3 + > >>> include/linux/seccomp.h | 12 +++ > >>> include/uapi/linux/bpf.h | 2 + > >>> include/uapi/linux/ptrace.h | 5 +- > >>> include/uapi/linux/seccomp.h | 15 ++-- > >>> kernel/bpf/syscall.c | 1 + > >>> kernel/ptrace.c | 3 + > >>> kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++ > ++++++++----- > >>> samples/bpf/Makefile | 9 +++ > >>> samples/bpf/bpf_load.c | 9 ++- > >>> samples/bpf/seccomp1_kern.c | 17 ++++ > >>> samples/bpf/seccomp1_user.c | 34 ++++++++ > >>> samples/bpf/seccomp2_kern.c | 24 ++++++ > >>> samples/bpf/seccomp2_user.c | 66 +++++++++++++++ > >>> 15 files changed, 362 insertions(+), 30 deletions(-) > >>> create mode 100644 samples/bpf/seccomp1_kern.c > >>> create mode 100644 samples/bpf/seccomp1_user.c > >>> create mode 100644 samples/bpf/seccomp2_kern.c > >>> create mode 100644 samples/bpf/seccomp2_user.c > >>> > >>> -- > >>> 2.14.1 > >>> > >> > >> > >> > >> -- > >> Kees Cook > >> Pixel Security > > > > -- > > > Jessie Frazelle > 4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3 > pgp.mit.edu > _______________________________________________ > Containers mailing list > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx > https://lists.linuxfoundation.org/mailman/listinfo/containers > -- - Brian Goff _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers