On 24/09/2020 01.29, Kees Cook wrote: > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@xxxxxxxxxxxx/ > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@xxxxxxxxxxxx/ > v1: > - rebase to for-next/seccomp > - finish X86_X32 support for both pinning and bitmaps > - replace TLB magic with Jann's emulator > - add JSET insn > > TODO: > - add ALU|AND insn > - significantly more testing > > Hi, > > This is a refresh of my earlier constant action bitmap series. It looks > like the RFC was missed on the container list, so I've CCed it now. :) > I'd like to work from this series, as it handles the multi-architecture > stuff. So, I agree with Jann's point that the only thing that matters is that always-allowed syscalls are indeed allowed fast. But one thing I'm wondering about and I haven't seen addressed anywhere: Why build the bitmap on the kernel side (with all the complexity of having to emulate the filter for all syscalls)? Why can't userspace just hand the kernel "here's a new filter: the syscalls in this bitmap are always allowed noquestionsasked, for the rest, run this bpf". Sure, that might require a new syscall or extending seccomp(2) somewhat, but isn't that a _lot_ simpler? It would probably also mean that the bpf we do get handed is a lot smaller. Userspace might need to pass a couple of bitmaps, one for each relevant arch, but you get the overall idea. I'm also a bit worried about the performance of doing that emulation; that's constant extra overhead for, say, launching a docker container. Regardless of how the kernel's bitmap gets created, something like + if (nr < NR_syscalls) { + if (test_bit(nr, bitmaps->allow)) { + *filter_ret = SECCOMP_RET_ALLOW; + return true; + } probably wants some nospec protection somewhere to avoid the irony of seccomp() being used actively by bad guys. Rasmus