On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote: > On 2020-05-18, Kees Cook <keescook@xxxxxxxxxxxx> wrote: > > ## fd passing > > > > Background: seccomp users want to be able to install an fd in a > > monitored process during a user_notif to emulate "open" calls (or > > similar), possibly across security boundaries, etc. > > > > On the fd passing front, it seems that gaining pidfd_addfd() is the way > > to go as it allows for generic use not tied to seccomp in particular. > > I expect this feature will be developed orthogonally to seccomp (where > > does this stand, BTW?). However, as Sargun has shown[1], seccomp could > > be friendlier to help with using it. Things that need to be resolved: > > > > - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but > > if we're going to step back and make some design choices here, is > > there a place for pidfds in seccomp user_notif, in order to avoid > > needing the user_notif cookie? I think probably not: it's a rather lot > > of overhead for notifications. It seems it's safe to perform an fd > > installation with these steps: > > - get pidnr from user_notif_recv > > - open pidfd from pidnr > > - re-verify user_notif cookie is still valid > > - send new fd via pidfd > > - reply with user_notif_send > > - close pidfd > > I agree that while it would be conceptually nicer to not need the > user_notif cookie, creating a new pidfd for every notification isn't a > good idea (not to mention that the program may not care about the pid or > pidfd when determining its response -- making the pidfd more of a pain > than anything else). > > You'd also need to have special-casing based on the userspace length of > user_notif because old programs wouldn't know they'd have to close the > pidfd returned (which then brings us back to the fact that user_notif > doesn't specify what the usize of the struct is a-la > copy_struct_from_user). > > > ## deep argument inspection > > > > Background: seccomp users would like to write filters that traverse > > the user pointers passed into many syscalls, but seccomp can't do this > > dereference for a variety of reasons (mostly involving race conditions and > > rearchitecting the entire kernel syscall and copy_from_user() code flows). > > > > During the last plumbers and in conversations since, the grudging > > consensus was reached that having seccomp do this for ALL syscalls was > > likely going to be extremely disruptive for very little gain (i.e. > > many things, like pathnames, have differing lifetimes, aliases, unstable > > kernel object references, etc[6]), but that there were a small subset of > > syscalls for which this WOULD be beneficial, and those are the newly > > created "Extensible Argument" syscalls (is there a better name for this > > design? I'm calling it "EA" for the rest of the email), like clone3(), > > openat2(), etc, which pass a pointer and a size: > > I called them "extensible struct" syscalls, but "extensible arguments" > is probably a better name. Though if you want to call them EA, make sure > to include the ™. ;) > > > long clone3(struct clone_args *cl_args, size_t size); > > > > I think it should be possible to extend seccomp to examine this structure > > by appending it to seccomp_data, and allowing filters to examine the > > contents. This means that no BPF language extensions are required for > > seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think > > seccomp's design principles work well with maps, kernel helpers, etc, > > and I think the earlier the examination of using eBPF for user_notif > > bares this out). > > While I agree that we don't need to jump to eBPF for this, I do think > we'll need an opcode for is_zeroed() (or bpf_check_uarg_tail_zero()) -- > see below. > > > In order for this to work, there are a number of prerequisites: > > > > - argument caching, in two halves: syscall side and seccomp side: > > - the EA syscalls needs to include awareness of potential seccomp > > hooking. i.e. seccomp may have done the copy_from_user() already and > > kept a cached copy. > > - seccomp needs to potentially DO the copy_from_user() itself when it > > hits these syscalls for a given filter, and put it somewhere for > > later use by the syscall. > > > > The argument caching bit is, I think, rather mechanical in nature since > > it's all "just" internal to the kernel: seccomp can likely adjust how it > > allocates seccomp_data (maybe going so far as to have it split across two > > pages with the syscall argument struct always starting on the 2nd page > > boundary), and copying the EA struct into that page, which will be both > > used by the filter and by the syscall. I imagine state tracking ("is > > there a cached EA?", "what is the address of seccomp_data?", "what is > > the address of the EA?") can be associated with the thread struct. > > I agree that while this is an important thing to handle properly, it's > internal and probably doesn't deserve the lions share of design > discussion. > > > - the sizes of these EA structs are, by design, growable over time. > > seccomp and its users need to be handle this in a forward and backward > > compatible way, similar to the design of the EA syscall interface > > itself. > > > > The growing size of the EA struct will need some API design. For filters > > to operate on the contiguous seccomp_data+EA struct, the filter will > > need to know how large seccomp_data is (more on this later), and how > > large the EA struct is. When the filter is written in userspace, it can > > do the math, point into the expected offsets, and get what it needs. For > > this to work correctly in the kernel, though, the seccomp BPF verifier > > needs to know the size of the EA struct as well, so it can correctly > > perform the offset checking (as it currently does for just the > > seccomp_data struct size). > > > > Since there is not really any caller-based "seccomp state" associated > > across seccomp(2) calls, I don't think we can add a new command to tell > > the kernel "I'm expecting the EA struct size to be $foo bytes", since > > the kernel doesn't track who "I" is besides just being "current", which > > doesn't take into account the thread lifetime -- if a process launcher > > knows about one size and the child knows about another, things will get > > confused. The sizes really are just associated with individual filters, > > based on the syscalls they're examining. So, I have thoughts on possible > > solutions: > > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the > > EA style so we can pass in more than a filter and include also an > > array of syscall to size mappings. (I don't like this...) > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes > > the meaning of the uarg from "filter" to a EA-style structure with > > sizes and pointers to the filter and an array of syscall to size > > mappings. (I like this slightly better, but I still don't like it.) > > - leverage the EA design and just accept anything <= PAGE_SIZE, record > > the "max offset" value seen during filter verification, and zero-fill > > the EA struct with zeros to that size when constructing the > > seccomp_data + EA struct that the filter will examine. Then the seccomp > > filter doesn't care what any of the sizes are, and userspace doesn't > > care what any of the sizes are. (I like this as it makes the problems > > to solve contained entirely by the seccomp infrastructure and does not > > touch user API, but I worry I'm missing some gotcha I haven't > > considered.) > > Okay, so here is my view on this. I think that the third option is > closest to what I'd like to see. Based on Jann's email, I think we're on > the same page but I'd just like to elaborate it a bit further: > > First of all -- ideally, the backward and forward compatibility that EA > syscalls give us should be reflected with seccomp filters being > similarly compatible. Otherwise we're going to run into issues where all > of the hard work with ensuring EA syscalls behave when extended will be > less valuable if seccomp cannot handle it sufficiently. This means that > I would hope that every combination of {old,new} filter/kernel/program > would work on a best-effort (but fail-safe) basis. > > In my view, the simplest way (from the kernel side) would be to simply > do what you outlined in (3) -- make all accesses past usize (and even > ksize) be zeroed. > > However in order to make an old filter fail-safe on a new kernel with a > new program, we'd need a new opcode which basically does > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This > would punt the fail-safe problem to userspace (libseccomp would need to > generate a check that any unknown-to-the-filter fields are zero). I > don't think this is a decision we can make in-kernel because it might be > that the filter doesn't care about the last field in a struct (and thus > doesn't access it) but we don't know the difference between a field the > filter doesn't care about and a field it doesn't know about. > > Another issue which you haven't mentioned here is how do we deal with > pointers inside an EA struct. clone3() already has this with set_tid, It's also passed with a set_tid_size argument so we'd know how large set_tid is. > and the thread on user_notif seems to be converging towards having the > user_notif2 struct just contain two pointers to EA structs. We could > punt on this for now, so long as we leave room for us to deal with this > problem in the future. However, I think it would be misguided to enable > deep argument inspection on syscalls which have such entries (such as > clone3) until we've figured out how we want to handle nested pointers -- We can't really punt on this and should figure this out now. I've pointed to this problem in my other mail as well. Though I currently fail to see why this should be a huge problem to get that working as long as each pointer comes with a known size. > at the very least because the bpf_check_uarg_tail_zero() opcode > semantics need to consider this problem. > > A silly proposal I had was that because the EA would never be bigger > than PAGE_SIZE, we could take the offset into the EA and multiply it by > PAGE_SIZE each time we wanted to do a "dereference" of a pointer in EA. If you have a size to it it should be ok? I'm probably missing something. Christian