On 24/09/2020 15.58, YiFei Zhu wrote: > On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes > <linux@xxxxxxxxxxxxxxxxxx> wrote: >> But one thing I'm wondering about and I haven't seen addressed anywhere: >> Why build the bitmap on the kernel side (with all the complexity of >> having to emulate the filter for all syscalls)? Why can't userspace just >> hand the kernel "here's a new filter: the syscalls in this bitmap are >> always allowed noquestionsasked, for the rest, run this bpf". Sure, that >> might require a new syscall or extending seccomp(2) somewhat, but isn't >> that a _lot_ simpler? It would probably also mean that the bpf we do get >> handed is a lot smaller. Userspace might need to pass a couple of >> bitmaps, one for each relevant arch, but you get the overall idea. > > Perhaps. The thing is, the current API expects any filter attaches to > be "additive". If a new filter gets attached that says "disallow read" > then no matter whatever has been attached already, "read" shall not be > allowed at the next syscall, bypassing all previous allowlist bitmaps > (so you need to emulate the bpf anyways here?). We should also not > have a API that could let anyone escape the secomp jail. Say "prctl" > is permitted but "read" is not permitted, one must not be allowed to > attach a bitmap so that "read" now appears in the allowlist. The only > way this could potentially work is to attach a BPF filter and a bitmap > at the same time in the same syscall, which might mean API redesign? Yes, the man page would read something like SECCOMP_SET_MODE_FILTER_BITMAP The system calls allowed are defined by a pointer to a Berkeley Packet Filter (BPF) passed via args. This argument is a pointer to a struct sock_fprog_bitmap; with that struct containing whatever information/extra pointers needed for passing the bitmap(s) in addition to the bpf prog. And SECCOMP_SET_MODE_FILTER would internally just be updated to work as-if all-zero allow-bitmaps were passed along. The internal kernel bitmap would just be the and of the bitmaps in the filter stack. Sure, it's UAPI, so would certainly need more careful thought on details of just how the arg struct looks like etc. etc., but I was wondering why it hadn't been discussed at all. >> I'm also a bit worried about the performance of doing that emulation; >> that's constant extra overhead for, say, launching a docker container. > > IMO, launching a docker container is so expensive this should be negligible. Regardless, I'd like to see some numbers, certainly for the "how much faster does a getpid() or read() or any of the other syscalls that nobody disallows" get, but also "what's the cost of doing that emulation at seccomp(2) time". Rasmus