On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes <linux@xxxxxxxxxxxxxxxxxx> wrote: > But one thing I'm wondering about and I haven't seen addressed anywhere: > Why build the bitmap on the kernel side (with all the complexity of > having to emulate the filter for all syscalls)? Why can't userspace just > hand the kernel "here's a new filter: the syscalls in this bitmap are > always allowed noquestionsasked, for the rest, run this bpf". Sure, that > might require a new syscall or extending seccomp(2) somewhat, but isn't > that a _lot_ simpler? It would probably also mean that the bpf we do get > handed is a lot smaller. Userspace might need to pass a couple of > bitmaps, one for each relevant arch, but you get the overall idea. Perhaps. The thing is, the current API expects any filter attaches to be "additive". If a new filter gets attached that says "disallow read" then no matter whatever has been attached already, "read" shall not be allowed at the next syscall, bypassing all previous allowlist bitmaps (so you need to emulate the bpf anyways here?). We should also not have a API that could let anyone escape the secomp jail. Say "prctl" is permitted but "read" is not permitted, one must not be allowed to attach a bitmap so that "read" now appears in the allowlist. The only way this could potentially work is to attach a BPF filter and a bitmap at the same time in the same syscall, which might mean API redesign? > I'm also a bit worried about the performance of doing that emulation; > that's constant extra overhead for, say, launching a docker container. IMO, launching a docker container is so expensive this should be negligible. YiFei Zhu _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers