[widening CC to the same scope as the code patches] Hello Tycho, Sorry for the late response on this. Could you amend/rebase your patches in light of the below please. And anyone's comment on my "big picture" text at the foot of this mail would be very welcome. On 12/13/18 1:11 AM, Tycho Andersen wrote: > Here's some documentation on how to use the userspace notification. I tried > to mention the biggest TOCTOU gotcha, but the pointer to the sample is > really the best thing, Cough! No, the "best thing" really is a good sample program *and* really good documentation :-). > as it has actual code and many comments about what's > going on. > > Signed-off-by: Tycho Andersen <tycho@xxxxxxxx> > CC: Kees Cook <keescook@xxxxxxxxxxxx> > --- > man2/seccomp.2 | 85 +++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 84 insertions(+), 1 deletion(-) > > diff --git a/man2/seccomp.2 b/man2/seccomp.2 > index d69187783..fef8914bf 100644 > --- a/man2/seccomp.2 > +++ b/man2/seccomp.2 > @@ -228,6 +228,13 @@ An administrator may override this filter flag by preventing specific > actions from being logged via the > .IR /proc/sys/kernel/seccomp/actions_logged > file. > +.TP > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER " (since Linux 4.21)" > +With this flag, a new userspace notification fd is returned on success. When > +this filter returns > +.BR SECCOMP_RET_USER_NOTIF > +a notification will be sent to this fd. See "Userspace Notification" below for s/fd/file descriptor/ throughout please. > +more details. I think the description here could be better worded as something like: SECCOMP_FILTER_FLAG_NEW_LISTENER Register a new filter, as usual, but on success return a new file descriptor that provides user-space notifications. When the filter returns SECCOMP_RET_USER_NOTIF, a notification will be provided via this file descriptor. The close-on-exec flag is automatically set on the new file descriptor. ... > .RE > .TP > .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)" > @@ -606,6 +613,17 @@ file. > .TP > .BR SECCOMP_RET_ALLOW > This value results in the system call being executed. > +.TP > +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)" Please see the start of this hanging list in the manual page. Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest in the precedence order of all of the filter return values? > +Forwards the syscall to an attached listener in userspace to allow userspace to s/syscall/system call throughout please. > +decide what to do with the syscall. If there is no attached listener (either > +because the filter was not installed with the > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER > +or because the fd was closed), the filter returns > +.BR ENOSYS > +similar to what happens when a filter returns > +.BR SECCOMP_RET_TRACE > +and there is no tracer. See "Userspace Notification" below for more details. > .PP > If an action value other than one of the above is specified, > then the filter action is treated as either > @@ -693,10 +711,75 @@ Otherwise, if kernel auditing is enabled and the process is being audited > the action is logged. > .IP * > Otherwise, the action is not logged. > +.SS Userspace Notification > +Interactin userspace notification functionality in seccomp is primarily done > +via file descriptor. That sentence is somewhat garbled. Even if I correct the typo, I still don't really understand it. Could you try again? (Or, take a look at my "big picture" text at the foot of this mail.) > This file descriptor can be obtained by passing > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER > +as a filter flag when installing a new filter. > +.PP > +Once an fd is obtained, userspace can wait for events using > +.BR poll () and presumably select() and epoll? What kind of notification event do poll(2)/epoll(7)/select(2) provide? It looks to be POLLIN/EPOLLIN/readable. Is that correct? These details should be noted here. More generally, my assumption is that you can use poll(2)/epoll(7)/select(2) to find out about the availability of an event, and then use SECCOMP_IOCTL_NOTIF_RECV to read that event. Correct? The text needs to be more explicit on this. > +or > +.BR ioctl (). > +The supported > +.BR ioctl () > +operations on a notification fd are: > +.TP > +.BR SECCOMP_IOCTL_NOTIF_RECV > +The argument to this command should be a pointer to a struct seccomp_notif: > +.IP > +.in +4n > +.EX > +struct seccomp_notif { > + __u64 id; > + __u32 pid; > + __u32 flags; > + struct seccomp_data data; > +}; > +.EE > +.in > +.IP > +The id field is a filter-unique id for this syscall, and should be supplied in > +the response. It can additionally be used in > +.BR SECCOMP_IOCTL_ID_VALID > +to test whether or not the request is still alive. The pid here is the pid of > +the task as visible from the listener's pid namespace. If the pid is not > +visible, it is 0. Flags is unused right now. So, is 'flags' explicitly zeroed by the kernel? the manual page should note this. > struct seccomp_data is the same > +data that would be passed to a filter running in the kernel. What are the semantics if multiple monitoring processes are employing SECCOMP_IOCTL_NOTIF_RECV? Does only one of them get awoken? (Which one?) Or do they all get woken up and get a 'struct seccomp_notif'? The semantics should be detailed here. > +.TP > +.BR SECCOMP_IOCTL_NOTIF_SEND > +The argument to this command should be a pointer to a struct seccomp_notif_resp: > +.IP > +.in +4n > +.EX > +struct seccomp_notif_resp { > + __u64 id; > + __s64 val; > + __s32 error; > + __u32 flags; > +}; > +.EE > +.in > +.IP > +The id should be the id from struct seccomp_notif; if error is non-zero, it is > +used as the return value, otherwise val is. Flags must be 0 right now. There really isn't enough detail here to help the reader understand. What is the purpose of 'error' vs 'val'? In particular, I'm assuming that the monitoring process has (at least) two choices with respect to the system call being made by the target process: (1) Cause the system call to fail with an error. (2) Allow the system call to proceed. How are 'error' and 'val' used in these two cases? Are there more than these two cases? How else is 'val' used? Also, what happens if the monitoring process does a SECCOMP_IOCTL_NOTIF_RECV, but does not subsequently do a SECCOMP_IOCTL_NOTIF_SEND? This should be detailed in the manual page. What are the semantics if multiple monitoring processes are employing SECCOMP_IOCTL_NOTIF_SEND? An error for all but the first? The semantics should be detailed here. > +.TP > +.BR SECCOMP_IOCTL_NOTIF_ID_VALID > +The argument to this command is just the id to be tested. There is a pid reuse > +issue where a task may make a syscall, die, its pid get re-used, and the > +listener may operate on the wrong pid. So the process for handling modifying a > +pid's state in some way would be to open the pid's resources (/proc/pid/mem or > +similar), do this call to verify that the id is still valid, and then use the > +resource. See the sample below for more detail. > +.PP > +A complete example is available in the kernel tree at > +.IR samples/seccomp/user-trap.c . > .SH RETURN VALUE > On success, > .BR seccomp () > -returns 0. > +returns 0, unless > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER > +was specified, in which case it returns the fd number for the new listener fd. > On error, if > .BR SECCOMP_FILTER_FLAG_TSYNC > was used, More generally though, I'm struggling with this patch because "the big picture" is lacking. Here's my estimate of what I think that picture is: [[ Instead of having the seccomp() filter decision being made within the filter itself, it is possible to pass off the decision making to a (different) user-space process that makes the decision. This is done using the following steps. 1. The "target process" (i.e., process that will establish a seccomp filter) and the process that will make decisions about system calls ("the monitoring process") establish a connection using a UNIX domain socket. This socket will subsequently be used to exchange a file descriptor. 2. The "target process" establishes a seccomp BPF filter in the usual manner, but with two notable differences: * The seccomp() 'flags' argument includes the flag SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return value of a successful seccomp() call is a new "listening" file descriptor that can be used for monitoring. * In cases where it is needed, the BPF filter returns the special action SECCOMP_RET_USER_NOTIF. This return value will trigger a notification event. 3. The "target process" passes the "listening file descriptor" to the "monitoring process" via the UNIX domain socket. 4. The target process then performs its workload, which includes system calls that will be controlled by the BPF filter. Whenever one of these system calls causes the BPF filter to return SECCOMP_RET_USER_NOTIF, a notification event is generated on the listening file descriptor. 5. The monitoring process will receive notification events on the listening file descriptor. These events are returned as structures of type 'seccomp_notif'. Because this structure and its size may evolve over kernel version, the monitoring process must first determine the size of this structure using the seccomp() SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of type 'seccomp_notif_sizes'. The monitoring process allocates a buffer of size 'seccomp_notif_sizes.seccomp_notif' bytes to receive monitoring events. In addition,the monitoring process allocates another buffer of size 'seccomp_notif_sizes.seccomp_notif_resp' bytes for the response (a 'struct seccomp_notif_resp') that it will provide to the kernel in order to advise how the system call being made by the target process shall be treated. 6. The monitoring process can now repeatedly monitor the listening file descriptor. When a SECCOMP_RET_USER_NOTIF-triggered event occurs, the file descriptor will test as readable for poll(2)/epoll(7)/select(2). The call ioctl(listenfd, SECCOMP_IOCTL_NOTIF_RECV, reqptr) can be used to read info on the event; this operation blocks until an event is available. It populates the 'struct seccomp_notif' pointed to by the third argument with information about the system call that is being attempted by the target process. 7. The monitoring process can use the information in the 'struct seccomp_notif' to make a determination about the system call being made by the target process. This structure includes a 'data' field that is the same 'struct seccomp_data' that is passed to a BPF filter. In addition, the monitoring process may make use of other information that is available from user space. For example, it may inspect the memory of the target process (whose PID is provided in the 'struct seccomp_notif') using /proc/PID/mem, which includes inspecting the values pointed to by system call arguments (whose location is available 'seccomp_notif.data.args). However, when using the target process PID in this way, one must guard against PID re-use race conditions using the seccomp() SECCOMP_IOCTL_NOTIF_ID_VALID operation. 8. Having arrived at a decision about the target process's system call, the monitoring process can inform the kernel of its decision using the operation ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr) where the third argument is a pointer to a 'struct seccomp_notif_resp'. [Some more details needed here, but I still don't yet understand fully the semantics of the 'error' and 'val' fields.] I'd like to include text along the above lines at the start of the "Userspace Notification" section. Could you take a look at the above and let me know any errors, improved terminology, or other improvements generally. I'll then work that text into the page. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers