On Mon, Jun 1, 2020 at 6:17 AM Lennart Poettering <lennart@xxxxxxxxxxxxxx> wrote: > On Fr, 29.05.20 12:27, Kees Cook (keescook@xxxxxxxxxxxx) wrote: > > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status > > Seccomp_filters: 32 > > > > # grep SystemCall /lib/systemd/system/systemd-resolved.service > > SystemCallArchitectures=native > > SystemCallErrorNumber=EPERM > > SystemCallFilter=@system-service > > > > I'd like to better understand what they're doing, but haven't had time > > to dig in. (The systemd devel mailing list requires subscription, so > > I've directly CCed some systemd folks that have touched seccomp there > > recently. Hi! The starts of this thread is here[4].) > > Hmm, so on x86-64 we try to install our seccomp filters three times: > for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32 > syscall ABI. Not all of the filters we apply work on all ABIs though, > because syscalls are available on some but not others, or cannot > sensibly be matched on some (because of socketcall, ipc and such > multiplexed syscalls). > > When we fist added support for seccomp filters to systemd we compiled > everything into a single filter, and let libseccomp apply it to > different archs. But that didn't work out, since libseccomp doesn't > tell use when it manages to apply a filter and when not, i.e. to which > arch it worked and to which arch it didn't. And since we have some > whitelist and some blacklist filters the internal fallback logic of > libsecccomp doesn't work for us either, since you never know what you > end up with. So we ended up breaking the different settings up into > individual filters, and apply them individually and separately for > each arch, so that we know exactly what we managed to install and what > not, and what we can then know will properly filter and can check in > our test suite. > > Keeping the filters separate made things a lot easier and simpler to > debug, and our log output and testing became much less of a black > box. We know exactly what worked and what didn't, and our test > validate each filter. In situations where the calling application creates multiple per-ABI filters, the seccomp_merge(3) function can be used to merge the filters into one. There are some limitations (same byte ordering, filter attributes, etc.) but in general it should work without problem when merging x86_64, x32, and x86. For what it is worth, libseccomp does handle things like the multiplexed socket syscalls[*] across multiple ABIs, just not quite in the way Lennart and systemd wanted. It is also possible, although I would be a bit surprised, that some of the systemd's concerns have been resolved in modern libseccomp. For better or worse, systemd was one of the first adopters of libseccomp and they had to deal with more than a few bumps as the library was developed. [*] Handling the multiplexed syscalls is tricky, especially when one combines multiple ABIs and the presence of both the multiplexed and direct-wired syscalls on some kernel versions. Recent libseccomp versions do handle all these cases; creating multiplexed filters, direct-wired filters, or both depending on the particular ABI. The problem comes when you try to wrap all of that up in a single library API that works regardless of the ABI and kernel version across different build and runtime environments. This is why we don't support the "exact" variants of the libseccomp API on filters which contain multiple ABIs, we simply can't guarantee that we will always be able to filter on the third argument socket() in a filter than consists of the x86_64 and x86 ABIs. The non-exact API variants create the rules as best they can in this case, creating three rules in the filter: a x86_64 rule which filters on the third argument of socket(), a x86 rule which filters on the third argument of the direct-wired socket(), and a x86 rule which filters on the multiplexed socketcall(socket) syscall (impossible to filter on the syscall argument here). > For systemd-resolved we apply a bunch more filters than just those > that are result of SystemCallFilter= and SystemCallArchitectures= > (SystemCallFilter= itself synthesizes one filter per syscall ABI). ... > So yeah, if one turns on many of these options in services (and we > generally turn on everything we can for the services we ship) and then > multiply that by the archs you end up with quite a bunch. I'm not sure how systemd is architected with respect to seccomp filtering, but once again it would seem like seccomp_merge() could be useful here. > If we wanted to optimize that in userspace, then libseccomp would have > to be improved quite substantially to let us know exactly what works > and what doesn't, and to have sane fallback both when building > whitelists and blacklists. It has been quite a while since we last talked about systemd's use of libseccomp, but the upcoming v2.5.0 release (no date set yet, but think weeks not months) finally takes a first step towards defining proper return values on error for the API, no more "negative values on error". I'm sure there are other things, but I recall this as being one of the bigger systemd wants. As an aside, it is always going to be difficult to allow fine grained control when you have a single libseccomp filter that includes multiple ABIs; the different ABI oddities are just too great (see comments above). If you need exacting control of the filter, or ABI specific handling, then the recommended way is to create those filters independently and merge them together before loading them into the kernel or applying any common rules. > An easy improvement is probably if libseccomp would now start refusing > to install x32 seccomp filters altogether now that x32 is entirely > dead? Or are the entrypoints for x32 syscalls still available in the > kernel? How could userspace figure out if they are available? If > libseccomp doesn't want to add code for that, we probably could have > that in systemd itself too... You can eliminate x32 syscalls today using libseccomp though either the "BADARCH" filter attribute or through a x32 specific filter that defaults to KILL/ERRNO/etc. and has no rules (of course you could merge this x32 filter with your x86_64 filter). While I don't see us removing the ability to create x32 filters from libseccomp any time soon (need to support older kernels), I can say that I would be very happy to see x32 removed from systems. Regardless of what one may think of the wisdom in creating this ABI, I think we can agree the implementation was a bit of a hack. -- paul moore www.paul-moore.com