On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote: > On Wed, Jul 15, 2020 at 04:07:00PM -0700, Kees Cook wrote: > [...] > > > Speaking to Stefano's proposal[1]: > > > > - There appear to be three classes of desired restrictions: > > - opcodes for io_uring_register() (which can be enforced entirely with > > seccomp right now). > > - opcodes from SQEs (this _could_ be intercepted by seccomp, but is > > not currently written) > > - opcodes of the types of restrictions to restrict... for making sure > > things can't be changed after being set? seccomp already enforces > > that kind of "can only be made stricter" > > In addition we want to limit the SQEs to use only the registered fd and buffers. Hmm, good point. Yeah, since it's an "extra" mapping (ioring file number vs fd number) this doesn't really map well to seccomp. (And frankly, there's some difficulty here mapping many of the ioring-syscalls to seccomp because it's happening "deeper" than the syscall layer (i.e. some of the arguments have already been resolved into kernel object pointers, etc). > Do you think it's better to have everything in seccomp instead of adding > the restrictions in io_uring (the patch isn't very big)? I'm still trying to understand how io_uring will be used, and it seems odd to me that it's effectively a seccomp bypass. (Though from what I can tell it is not an LSM bypass, which is good -- though I'm worried there might be some embedded assumptions in LSMs about creds vs current and LSMs may try to reason (or report) on actions with the kthread in mind, but afaict everything important is checked against creds. > With seccomp, would it be possible to have different restrictions for two > instances of io_uring in the same process? For me, this is the most compelling reason to have the restrictions NOT implemented via seccomp. Trying to make "which instance" choice in seccomp would be extremely clumsy. So at this point, I think it makes sense for the restriction series to carry on -- it is io_uring-specific and solves some problems that seccomp is not in good position to reason about. All this said, I'd still like a way to apply seccomp to io_uring because it's a rather giant syscall filter bypass mechanism, and gaining access (IIUC) is possible without actually calling any of the io_uring syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS, pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain access to the SQ and CQ, and off it goes? (The only glitch I see is waking up the worker thread?) What appears to be the worst bit about adding seccomp to io_uring is the almost complete disassociation of process hierarchy from syscall action. Only a cred is used for io_uring, and seccomp filters are associated with task structs. I'm not sure if there is a way to solve this disconnect without a major internal refactoring of seccomp to attach to creds and then make every filter attachment create a new cred... *head explody* -- Kees Cook