On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote: > > On 2020-05-19, Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > > > ## deep argument inspection > > > > > > Background: seccomp users would like to write filters that traverse > > > the user pointers passed into many syscalls, but seccomp can't do this > > > dereference for a variety of reasons (mostly involving race conditions and > > > rearchitecting the entire kernel syscall and copy_from_user() code flows). > > > > Also, other than for syscall entry, it might be worth thinking about > > whether we want to have a special hook into seccomp for io_uring. > > io_uring is growing support for more and more syscalls, including > > things like openat2, connect, sendmsg, splice and so on, and that list > > is probably just going to grow in the future. If people start wanting > > to use io_uring in software with seccomp filters, it might be > > necessary to come up with some mechanism to prevent io_uring from > > permitting access to almost everything else... > > > > Probably not a big priority for now, but something to keep in mind for > > the future. > > Indeed. Quite a few people have raised concerns about io_uring and its > debug-ability, but I agree that another less-commonly-mentioned concern > should be how you restrict io_uring(2) from doing operations you've > disallowed through seccomp. Though obviously user_notif shouldn't be > allowed. :D > > > > The argument caching bit is, I think, rather mechanical in nature since > > > it's all "just" internal to the kernel: seccomp can likely adjust how it > > > allocates seccomp_data (maybe going so far as to have it split across two > > > pages with the syscall argument struct always starting on the 2nd page > > > boundary), and copying the EA struct into that page, which will be both > > > used by the filter and by the syscall. > > > > We could also do the same kind of thing the eBPF verifier does in > > convert_ctx_accesses(), and rewrite the context accesses to actually > > go through two different pointers depending on the (constant) offset > > into seccomp_data. > > My main worry with this is that we'll need to figure out what kind of > offset mathematics are necessary to deal with pointers inside the > extensible struct. As a very ugly proposal, you could make it so that > you multiply the offset by PAGE_SIZE each time you want to dereference > the pointer at that offset (unless we want to add new opcodes to cBPF to > allow us to represent this). Please don't. cbpf is frozen. > > This might even be needed for seccomp user_notif -- given one of the > recent proposals was basically to just add two (extensible) struct > pointers inside the main user_notif struct. > > > > I imagine state tracking ("is > > > there a cached EA?", "what is the address of seccomp_data?", "what is > > > the address of the EA?") can be associated with the thread struct. > > > > You probably mean the task struct? > > > > > The growing size of the EA struct will need some API design. For filters > > > to operate on the contiguous seccomp_data+EA struct, the filter will > > > need to know how large seccomp_data is (more on this later), and how > > > large the EA struct is. When the filter is written in userspace, it can > > > do the math, point into the expected offsets, and get what it needs. For > > > this to work correctly in the kernel, though, the seccomp BPF verifier > > > needs to know the size of the EA struct as well, so it can correctly > > > perform the offset checking (as it currently does for just the > > > seccomp_data struct size). > > > > > > Since there is not really any caller-based "seccomp state" associated > > > across seccomp(2) calls, I don't think we can add a new command to tell > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since > > > the kernel doesn't track who "I" is besides just being "current", which > > > doesn't take into account the thread lifetime -- if a process launcher > > > knows about one size and the child knows about another, things will get > > > confused. The sizes really are just associated with individual filters, > > > based on the syscalls they're examining. So, I have thoughts on possible > > > solutions: > > > > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the > > > EA style so we can pass in more than a filter and include also an > > > array of syscall to size mappings. (I don't like this...) > > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes > > > the meaning of the uarg from "filter" to a EA-style structure with > > > sizes and pointers to the filter and an array of syscall to size > > > mappings. (I like this slightly better, but I still don't like it.) > > > - leverage the EA design and just accept anything <= PAGE_SIZE, record > > > the "max offset" value seen during filter verification, and zero-fill > > > the EA struct with zeros to that size when constructing the > > > seccomp_data + EA struct that the filter will examine. Then the seccomp > > > filter doesn't care what any of the sizes are, and userspace doesn't > > > care what any of the sizes are. (I like this as it makes the problems > > > to solve contained entirely by the seccomp infrastructure and does not > > > touch user API, but I worry I'm missing some gotcha I haven't > > > considered.) > > > > We don't need to actually zero-fill memory for this beyond what the > > kernel supports - AFAIK the existing APIs already say that passing a > > short length is equivalent to passing zeroes, so we can just replace > > all out-of-bounds loads with zeroing registers in the filter. > > The tricky case is what should happen if the userspace program passes > > in fields that the filter doesn't know about. The filter can see the > > length field passed in by userspace, and then just reject anything > > where the length field is bigger than the structure size the filter > > knows about. But maybe it'd be slightly nicer if there was an > > operation for "tell me whether everything starting at offset X is > > zeroes", so that if someone compiles with newer kernel headers where > > the struct is bigger, and leaves the new fields zeroed, the syscalls > > still go through an outdated filter properly. > > I think the best way of handling this (without breaking programs > senselessly) is to have filters essentially emulate > copy_struct_from_user() semantics -- which is along the lines of what > you've suggested. and cpbf load instruction will become copy_from_user() underneath? I don't see how that can work. Have you considered implications to jits, register usage, etc ? ebpf will become sleepable soon. It will be able to do copy_from_user() and examine any level of user pointer dereference. toctou is still going to be a concern though, but such ebpf+copy_from_user analysis and syscall sandboxing will not need to change kernel code base around syscalls at all. No need to invent E-syscalls and all the rest I've seen in this thread. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers