[trying to get back to this thread -- I've been distracted] On Tue, May 19, 2020 at 12:39:39AM +0200, Jann Horn wrote: > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > > ## deep argument inspection > > > > Background: seccomp users would like to write filters that traverse > > the user pointers passed into many syscalls, but seccomp can't do this > > dereference for a variety of reasons (mostly involving race conditions and > > rearchitecting the entire kernel syscall and copy_from_user() code flows). > > Also, other than for syscall entry, it might be worth thinking about > whether we want to have a special hook into seccomp for io_uring. > io_uring is growing support for more and more syscalls, including > things like openat2, connect, sendmsg, splice and so on, and that list > is probably just going to grow in the future. If people start wanting > to use io_uring in software with seccomp filters, it might be > necessary to come up with some mechanism to prevent io_uring from > permitting access to almost everything else... /me perks up. Oh my, I hadn't been paying attention -- I thought this was strictly for I/O ... like it's named. I will go read up. > [...] > > The argument caching bit is, I think, rather mechanical in nature since > > it's all "just" internal to the kernel: seccomp can likely adjust how it > > allocates seccomp_data (maybe going so far as to have it split across two > > pages with the syscall argument struct always starting on the 2nd page > > boundary), and copying the EA struct into that page, which will be both > > used by the filter and by the syscall. > > We could also do the same kind of thing the eBPF verifier does in > convert_ctx_accesses(), and rewrite the context accesses to actually > go through two different pointers depending on the (constant) offset > into seccomp_data. Ah, as in "for seccomp_data accesses, add offset $foo and for EA struct add offset $bar"? Yeah, though my preference is to avoid rewriting the filters as much as possible. But yes, that's a good point about not requiring them be strictly contiguous. > > I imagine state tracking ("is > > there a cached EA?", "what is the address of seccomp_data?", "what is > > the address of the EA?") can be associated with the thread struct. > > You probably mean the task struct? Yup; think-o. > > ## syscall bitmasks > > YES PLEASE I've got a working PoC for this now. It's sneaky. > Other options: > - add a "load from read-only memory" opcode and permit specifying the > data that should be in that memory when loading the filter I think you've mentioned something like this before to me, but can you remind me the details? If you mean RO userspace memory, don't we still run the risk of racing mprotect, etc? > - make the seccomp API take an array of (syscall-number, > instruction-offset) tuples, and start evaluation of the filter at an > offset if it's one of those syscalls To avoid making cBPF changes, yeah, perhaps have a way to add per-syscall filters. > One more thing that would be really nice: Is there a way we can have > 64-bit registers in our seccomp filters? At the moment, every > comparison has to turn into three ALU ops, which is pretty messy; > libseccomp got that wrong (<https://crbug.com/project-zero/1769>), and > it contributes to the horrific code Chrome's BPF generator creates. > Here's some pseudocode from my hacky BPF disassembler, which shows > pretty much just the filter code for filtering getpriority() and > setpriority() in a Chrome renderer, with tons of useless dead code: > > 0139 if args[0].high == 0x00000000: [true +3, false +0] > 013e if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP > 0140 if args[1].high == 0x00000000: [true +3, false +0] > 0145 if args[1].low == 0x00000000: [true +149, false +0] > -> ret ALLOW (syscalls: getpriority, setpriority) > 0147 if args[1].high == 0x00000000: [true +3, false +0] > 014c if args[1].low == 0x00000001: [true +142, false > +141] -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 0148 if args[1].high != 0xffffffff: [true +142, false +0] > -> ret TRAP > 014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140, > false +0] -> ret TRAP > 014c if args[1].low == 0x00000001: [true +142, false +141] > -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 0141 if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP > 0143 if args[1].low NO-COMMON-BITS 0x80000000: [true +147, > false +0] -> ret TRAP > 0145 if args[1].low == 0x00000000: [true +149, false +0] -> > ret ALLOW (syscalls: getpriority, setpriority) > 0147 if args[1].high == 0x00000000: [true +3, false +0] > 014c if args[1].low == 0x00000001: [true +142, false +141] > -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 0148 if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP > 014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140, > false +0] -> ret TRAP > 014c if args[1].low == 0x00000001: [true +142, false +141] > -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 013a if args[0].high != 0xffffffff: [true +156, false +0] -> ret TRAP > 013c if args[0].low NO-COMMON-BITS 0x80000000: [true +154, > false +0] -> ret TRAP > 013e if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP > 0140 if args[1].high == 0x00000000: [true +3, false +0] > 0145 if args[1].low == 0x00000000: [true +149, false +0] -> > ret ALLOW (syscalls: getpriority, setpriority) > 0147 if args[1].high == 0x00000000: [true +3, false +0] > 014c if args[1].low == 0x00000001: [true +142, false +141] > -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 0148 if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP > 014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140, > false +0] -> ret TRAP > 014c if args[1].low == 0x00000001: [true +142, false +141] > -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 0141 if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP > 0143 if args[1].low NO-COMMON-BITS 0x80000000: [true +147, > false +0] -> ret TRAP > 0145 if args[1].low == 0x00000000: [true +149, false +0] -> > ret ALLOW (syscalls: getpriority, setpriority) > 0147 if args[1].high == 0x00000000: [true +3, false +0] > 014c if args[1].low == 0x00000001: [true +142, false +141] > -> ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > 0148 if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP > 014a if args[1].low NO-COMMON-BITS 0x80000000: [true +140, > false +0] -> ret TRAP > 014c if args[1].low == 0x00000001: [true +142, false +141] -> > ret ALLOW (syscalls: getpriority, setpriority) > 01da ret ERRNO > > which is generated by this little snippet of C++ code: > > ResultExpr RestrictGetSetpriority(pid_t target_pid) { > const Arg<int> which(0); > const Arg<int> who(1); > return If(which == PRIO_PROCESS, > Switch(who).CASES((0, target_pid), Allow()).Default(Error(EPERM))) > .Else(CrashSIGSYS()); > } What would this look like in eBPF? > On anything other than 32-bit MIPS, 32-bit powerpc and 32-bit sparc, > we're actually already using the eBPF backend infrastructure... so > maybe it would be an option to keep enforcing basically the same rules > that we currently have for cBPF, but use the eBPF instruction format? Yeah, I think this might be a good idea just for the reduction in complexity for these things. The "unpriv BPF" problem needs to be solved for this still, though, yes? -- Kees Cook _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers