Earlier Andy Lutomirski wrote: > Let’s add some seccomp folks. We probably also want to be able to run > seccomp-like filters on io_uring requests. So maybe io_uring should call into > seccomp-and-tracing code for each action. Okay, I'm finally able to spend time looking at this. And thank you to the many people that CCed me into this and earlier discussions (at least Jann, Christian, and Andy). It *seems* like there is a really clean mapping of SQE OPs to syscalls. To that end, yes, it should be trivial to add ptrace and seccomp support (sort of). The trouble comes for doing _interception_, which is how both ptrace and seccomp are designed. In the basic case of seccomp, various syscalls are just being checked for accept/reject. It seems like that would be easy to wire up. For the more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc), I think any such results would need to be "upgraded" to "reject". Things are a bit complex in that seccomp's form of "reject" can be "return errno" (easy) or it can be "kill thread (or thread_group)" which ... becomes less clear. (More on this later.) In the basic case of "I want to run strace", this is really just a creative use of ptrace in that interception is being used only for reporting. Does ptrace need to grow a way to create/attach an io_uring eventfd? Or should there be an entirely different tool for administrative analysis of io_uring events (kind of how disk IO can be monitored)? For io_uring generally, I have a few comments/questions: - Why did a new syscall get added that couldn't be extended? All new syscalls should be using Extended Arguments. :( - Why aren't the io_uring syscalls in the man-page git? (It seems like they're in liburing, but that's should document the _library_ not the syscalls, yes?) Speaking to Stefano's proposal[1]: - There appear to be three classes of desired restrictions: - opcodes for io_uring_register() (which can be enforced entirely with seccomp right now). - opcodes from SQEs (this _could_ be intercepted by seccomp, but is not currently written) - opcodes of the types of restrictions to restrict... for making sure things can't be changed after being set? seccomp already enforces that kind of "can only be made stricter" - Credentials vs no_new_privs needs examination (more on this later) So, I think, at least for restrictions, seccomp should absolutely be the place to get this work done. It already covers 2 of the 3 points in the proposal. Solving the mapping of seccomp interception types into CQEs (or anything more severe) will likely inform what it would mean to map ptrace events to CQEs. So, I think they're related, and we should get seccomp hooked up right away, and that might help us see how (if) ptrace should be attached. Specifically for seccomp, I see at least the following design questions: - How does no_new_privs play a role in the existing io_uring credential management? Using _any_ kind of syscall-effective filtering, whether it's seccomp or Stefano's existing proposal, needs to address the potential inheritable restrictions across privilege boundaries (which is what no_new_privs tries to eliminate). In regular syscall land, this is an issue when a filter follows a process through setuid via execve() and it gains privileges that now the filter-creator can trick into doing weird stuff -- io_uring has a concept of alternative credentials so I have to ask about it. (I don't *think* there would be a path to install a filter before gaining privilege, but I likely just need to do my homework on the io_uring internals. Regardless, use of seccomp by io_uring would need to have this issue "solved" in the sense that it must be "safe" to filter io_uring OPs, from a privilege-boundary-crossing perspective. - From which task perspective should filters be applied? It seems like it needs to follow the io_uring personalities, as that contains the credentials. (This email is a brain-dump so far -- I haven't gone to look to see if that means io_uring is literally getting a reference to struct cred; I assume so.) Seccomp filters are attached to task_struct. However, for v5.9, seccomp will gain a more generalized get/put system for having filters attached to the SECCOMP_RET_USER_NOTIF fd. Adding more get/put-ers for some part of the io_uring context shouldn't be hard. - How should seccomp return values be applied? Three seem okay: SECCOMP_RET_ALLOW: do SQE action normally SECCOMP_RET_LOG: do SQE action, log via seccomp SECCOMP_RET_ERRNO: skip actions in SQE and pass errno to CQE The rest not so much: SECCOMP_RET_TRAP: can't send SIGSYS anywhere sane? SECCOMP_RET_TRACE: no tracer, can't send SIGSYS? SECCOMP_RET_USER_NOTIF: can't do user_notif rewrites? SECCOMP_RET_KILL_THREAD: kill which thread? SECCOMP_RET_KILL_PROCESS: kill which thread group? If TRAP, TRACE, and USER_NOTIF need to be "upgraded" to KILL_THREAD, what does KILL_THREAD mean? Does it really mean "shut down the entire SQ?" Does it mean kill the worker thread? Does KILL_PROCESS mean kill all the tasks with an open mapping for the SQ? Anyway, I'd love to hear what folks think, but given the very direct mapping from SQE OPs to syscalls, I really think seccomp needs to be inserted in here somewhere to maintain any kind of sensible reasoning about syscall filtering. -Kees [1] https://lore.kernel.org/lkml/20200710141945.129329-3-sgarzare@xxxxxxxxxx/ -- Kees Cook