On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote: > Hi! > > This is my attempt at a brain-dump on my plans for nearish-term seccomp > features. Welcome to my TED talk... ;) > > These are the things I've been thinking about: > > - fd passing > - deep argument inspection > - changing structure sizes > - syscall bitmasks > > So, diving right in: > > > ## fd passing > > Background: seccomp users want to be able to install an fd in a > monitored process during a user_notif to emulate "open" calls (or > similar), possibly across security boundaries, etc. > > On the fd passing front, it seems that gaining pidfd_addfd() is the way > to go as it allows for generic use not tied to seccomp in particular. > I expect this feature will be developed orthogonally to seccomp (where > does this stand, BTW?). However, as Sargun has shown[1], seccomp could > be friendlier to help with using it. Things that need to be resolved: > > - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but > if we're going to step back and make some design choices here, is > there a place for pidfds in seccomp user_notif, in order to avoid > needing the user_notif cookie? I think probably not: it's a rather lot > of overhead for notifications. It seems it's safe to perform an fd > installation with these steps: > - get pidnr from user_notif_recv > - open pidfd from pidnr > - re-verify user_notif cookie is still valid > - send new fd via pidfd > - reply with user_notif_send > - close pidfd Yep, this looks safe. > - how to deal with changing sizes of the user_notif structures to > include a pidnr. (Which will be its own topic below.) > > > ## deep argument inspection > > Background: seccomp users would like to write filters that traverse > the user pointers passed into many syscalls, but seccomp can't do this > dereference for a variety of reasons (mostly involving race conditions and > rearchitecting the entire kernel syscall and copy_from_user() code flows). > > During the last plumbers and in conversations since, the grudging > consensus was reached that having seccomp do this for ALL syscalls was > likely going to be extremely disruptive for very little gain (i.e. > many things, like pathnames, have differing lifetimes, aliases, unstable > kernel object references, etc[6]), but that there were a small subset of > syscalls for which this WOULD be beneficial, and those are the newly > created "Extensible Argument" syscalls (is there a better name for this > design? I'm calling it "EA" for the rest of the email), like clone3(), > openat2(), etc, which pass a pointer and a size: > > long clone3(struct clone_args *cl_args, size_t size); > > I think it should be possible to extend seccomp to examine this structure > by appending it to seccomp_data, and allowing filters to examine the > contents. This means that no BPF language extensions are required for > seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think > seccomp's design principles work well with maps, kernel helpers, etc, > and I think the earlier the examination of using eBPF for user_notif > bares this out). > > In order for this to work, there are a number of prerequisites: > > - argument caching, in two halves: syscall side and seccomp side: > - the EA syscalls needs to include awareness of potential seccomp > hooking. i.e. seccomp may have done the copy_from_user() already and > kept a cached copy. > - seccomp needs to potentially DO the copy_from_user() itself when it > hits these syscalls for a given filter, and put it somewhere for > later use by the syscall. > - the sizes of these EA structs are, by design, growable over time. > seccomp and its users need to be handle this in a forward and backward > compatible way, similar to the design of the EA syscall interface > itself. > > The argument caching bit is, I think, rather mechanical in nature since > it's all "just" internal to the kernel: seccomp can likely adjust how it > allocates seccomp_data (maybe going so far as to have it split across two > pages with the syscall argument struct always starting on the 2nd page > boundary), and copying the EA struct into that page, which will be both > used by the filter and by the syscall. I imagine state tracking ("is > there a cached EA?", "what is the address of seccomp_data?", "what is > the address of the EA?") can be associated with the thread struct. > > The growing size of the EA struct will need some API design. For filters > to operate on the contiguous seccomp_data+EA struct, the filter will > need to know how large seccomp_data is (more on this later), and how > large the EA struct is. When the filter is written in userspace, it can > do the math, point into the expected offsets, and get what it needs. For > this to work correctly in the kernel, though, the seccomp BPF verifier > needs to know the size of the EA struct as well, so it can correctly > perform the offset checking (as it currently does for just the > seccomp_data struct size). > > Since there is not really any caller-based "seccomp state" associated > across seccomp(2) calls, I don't think we can add a new command to tell > the kernel "I'm expecting the EA struct size to be $foo bytes", since > the kernel doesn't track who "I" is besides just being "current", which > doesn't take into account the thread lifetime -- if a process launcher > knows about one size and the child knows about another, things will get > confused. The sizes really are just associated with individual filters, > based on the syscalls they're examining. So, I have thoughts on possible > solutions: > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the > EA style so we can pass in more than a filter and include also an > array of syscall to size mappings. (I don't like this...) > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes > the meaning of the uarg from "filter" to a EA-style structure with > sizes and pointers to the filter and an array of syscall to size > mappings. (I like this slightly better, but I still don't like it.) > - leverage the EA design and just accept anything <= PAGE_SIZE, record > the "max offset" value seen during filter verification, and zero-fill > the EA struct with zeros to that size when constructing the > seccomp_data + EA struct that the filter will examine. Then the seccomp > filter doesn't care what any of the sizes are, and userspace doesn't > care what any of the sizes are. (I like this as it makes the problems > to solve contained entirely by the seccomp infrastructure and does not > touch user API, but I worry I'm missing some gotcha I haven't > considered.) This sounds pretty nice to me :) > And then, my age-old concern, that maybe doesn't need a solution... I > remain plagued by the lack of pathname inspection. But I think the > ToCToU nature of it means we just cannot do it from seccomp. It does > make filtering openat2()'s EA struct a bit funny... a filter has no idea > what path it applies to... but that doesn't matter because the object > the path points to might change[6] during the syscall. Argh. > > > ## changing structure sizes > > Background: there have been regular needs to add things to various > seccomp structures. Each come with their own unique pains, and solving > this as completely as possible in a future-proof way would be very nice. > > As noted in "fd passing" above, there is a desire to add some useful > things to the user_notif struct (e.g. thread group pid). Similarly, > there have been requests in the past (though I can't find any good > references right now, just my indirect comments[3]) to grow seccomp_data. > Combined with the EA struct work above, there's a clear need for seccomp > to reexamine how it deals with its API structures (and how this > interacts with filters). > > First, let's consider seccomp_data. If we grow it, the EA struct offset > will move, based on the deep arg inspection design above. Alternatively, > we could instead put seccomp_data offset 0, and EA struct at offset > PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let > the filter access whatever it thinks is there, with it being zero-filled > by the kernel. For any values where 0 is valid, there will just need to > be a "is that field valid?" bit before it: > > unsigned long feature_bits; > unsigned long interesting_thing_1; > unsigned long interesting_thing_2; > unsigned long interesting_thing_3; > ... > > and the filter would check feature_bits... > > (However, this needs to be carefully considered given that seccomp_data > is embedded in user_notif... should the EA struct from userspace also be > copied into user_notif? More thoughts on this below...) > > For user_notif, I think we need something in and around these options: > > - make a new API that explicitly follows EA struct design > (and while read()/write() might be easier[4], I tend to agree with > Jann and we need to stick to ioctl(): as Tycho noted, "read/write is > for data". Though I wonder if read() could be used for the notifications, > which ARE data, and use ioctl() for the responses?) > - make a new API that is perf_event_open()-style where fields are > explicitly requested, as Sargun suggested[5]. (This looks like it > might be complex to construct, but would get us by far the most > extensible API.) This already exists, and was my original intent with the current API. You can add new ioctl()s: struct seccomp_interesting_stuff { __u16 size; __u64 id; __u64 interesting_thing_1; __u64 interesting_thing_2; }; ioctl(listener, SECCOMP_IOCTL_GET_INTERESTING_STUFF, &buf); The original intent was that you could either grow one of these structures, or add a new one. It's not clear to me really *why* we need a whole new API where everything fits into one structure. Or, we could build a requested fields approach that you discuss on top. But I don't see a reason to add a whole new RECV2. > - jam whatever we pick into the existing API (we'll be forced to do > SOMETHING to make the old API still work, so, I dunno what that will > look like until we finish the rest of the design). > > If we did a requested-fields approach, what would the user_notif event > block of bytes look like? Would it be entirely dynamic based on the > initial ioctl()? Another design consideration here is that we don't want > the kernel doing tons of work (especially copying) and tossing tons > of stuff into a huge structure that the user doesn't care about. In > addition to explicit fields, maybe the EA struct could be included, > perhaps with specified offset/size, so only the portion the user_notif > user wanted to inspect was copied? > > The complexity of the per-field API is higher, but I think it might be > the most robust and have the greatest chance at being performant. > For example, "send me user_notif but I only care about the pid" would > mean no syscall arguments are copied, etc. > > > ## syscall bitmasks > > Background: the number one bit of feedback on seccomp has been performance > concerns, especially for fast syscalls like futex(). When looking at > where time is spent, it is very clearly spent running the filters (which > I found surprising, given that adding TIF_SECCOMP tended to trip the > "slow path" syscall path (though most architectures these days just > don't have a fast path any more thanks to Meltdown). It would be nice > to make filtering faster without running BPF at all. :) > > Nearly every thread on adding eBPF, for example, has been about trying to > speed up the if/then nature of BPF for finding a syscall that the filter > wants to always accept (or reject). The bulk of most seccomp filters > are pretty simple, usually they're either "reject everything except > $set-of-syscalls", or "accept everything except $set-of-syscalls". The > stuff in between tends to be a mix, like "accept $some, process $these > with argument checks, and reject $remaining". > > In all three cases, the entire seccomp() path could be sped up by having > a syscall bitmask that could be applied before the filters are ever run, > with 3 (actually 2) syscall bitmasks: accept, reject, or process. If > a syscall was in the "accept" bitmask, we immediately exit seccomp and > continue. Otherwise, if it's in the "reject" bitmask, we mark it rejected > and immediately exit seccomp. And finally, we run the filters. In all > ways, doing bitmask math is going to be faster than running the BPF. ;) > > So how would the API for this work? I have two thoughts, and I don't > think they're exclusive: > > - new API for "add this syscall to the reject bitmask". We can't really > do an "accept" bitmask addition without processing the attached > filters... > - process attached filters! Each time a filter is added, have the > BPF verifier do an analysis to determine if there are any static > results, and set bits in the various bitmasks to represent it. > i.e. when seccomp is first enabled on a thread, the "accept" > bitmask is populated with all syscalls, and for each filter, do > [math,simulation,magic] and knock each syscall out of "accept" if This sounds like a recipe for bugs :) > it isn't always accepted, and further see if there are any syscalls > that are always rejected, and mark those in the "reject" bitmask. What if instead you have a reject bitmask which is always fine to add to, and an accept bitmask which is only valid if there are no actual filters attached. Is that enough for people? Tycho _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers