Hello Christian Thanks for taking a look at the page. On 10/29/20 4:26 PM, Christian Brauner wrote: > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote: >> Hi all (and especially Tycho and Sargun), >> >> Following review comments on the first draft (thanks to Jann, Kees, >> Christian and Tycho), I've made a lot of changes to this page. >> I've also added a few FIXMEs relating to outstanding API issues. >> I'd like a second pass review of the page before I release it. >> But also, this mail serves as a way of noting the outstanding API >> issues. >> >> Tycho: I still have an outstanding question for you at [2]. >> >> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD >> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page? >> >> I've shown the rendered version of the page below. The page source >> currently sits in a branch at >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif >> >> At this point, I'm mainly interested in feedback about the FIXMEs, >> some of which relate to the text of the page itself, while the >> others relate to the various outstanding API issues. The first >> FIXME provides a small opportunity for some bikeshedding :-); > > I like this manpage. I think this is the most comprehensive explanation > of any seccomp feature Thanks (at least, I think so...) > and somewhat understandable. ^^^^^^^^ (... but I'm not sure ;-).) > Just tiny comments below, hopefully no bike-shedding though. :) Most relevant point for bikeshedding is the page name. I plan to change it to seccomp_unotify(2) (shorter, reads better out loud). >> Thanks, >> >> Michael >> >> [1] https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90857@xxxxxxxxx/ >> [2] https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e684d8@xxxxxxxxx/ >> >> ===== >> >> SECCOMP_USER_NOTIF(2) Linux Programmer's Manual SECCOMP_USER_NOTIF(2) [...] >> An overview of the steps performed by the target and the >> supervisor is as follows: >> >> 1. The target establishes a seccomp filter in the usual manner, >> but with two differences: >> >> · The seccomp(2) flags argument includes the flag >> SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return >> value of the (successful) seccomp(2) call is a new >> "listening" file descriptor that can be used to receive >> notifications. Only one "listening" seccomp filter can be >> installed for a thread. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Is the last sentence above correct? │ >> │ │ >> │Kees Cook (25 Oct 2020) notes: │ >> │ │ >> │I like this limitation, but I expect that it'll need │ >> │to change in the future. Even with LSMs, we see the │ >> │need for arbitrary stacking, and the idea of there │ >> │being only 1 supervisor will eventually break down. │ >> │Right now there is only 1 because only container │ >> │managers are using this feature. But if some daemon │ >> │starts using it to isolate some thread, suddenly it │ >> │might break if a container manager is trying to │ >> │listen to it too, etc. I expect it won't be needed │ >> │soon, but I do think it'll change. │ >> │ │ >> └─────────────────────────────────────────────────────┘ >> >> · In cases where it is appropriate, the seccomp filter returns >> the action value SECCOMP_RET_USER_NOTIF. This return value >> will trigger a notification event. >> >> 2. In order that the supervisor can obtain notifications using >> the listening file descriptor, (a duplicate of) that file >> descriptor must be passed from the target to the supervisor. >> One way in which this could be done is by passing the file >> descriptor over a UNIX domain socket connection between the >> target and the supervisor (using the SCM_RIGHTS ancillary >> message type described in unix(7)). > > Fwiw, on newer kernels you could also use pidfd_getfd() for that. Thanks. I added that to the text. >> 3. The supervisor will receive notification events on the >> listening file descriptor. These events are returned as >> structures of type seccomp_notif. Because this structure and >> its size may evolve over kernel versions, the supervisor must >> first determine the size of this structure using the >> seccomp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a >> structure of type seccomp_notif_sizes. The supervisor >> allocates a buffer of size seccomp_notif_sizes.seccomp_notif >> bytes to receive notification events. In addition,the >> supervisor allocates another buffer of size >> seccomp_notif_sizes.seccomp_notif_resp bytes for the response >> (a struct seccomp_notif_resp structure) that it will provide >> to the kernel (and thus the target). >> >> 4. The target then performs its workload, which includes system >> calls that will be controlled by the seccomp filter. Whenever >> one of these system calls causes the filter to return the >> SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet) >> execute the system call; instead, execution of the target is >> temporarily blocked inside the kernel (in a sleep state that >> is interruptible by signals) and a notification event is >> generated on the listening file descriptor. >> >> 5. The supervisor can now repeatedly monitor the listening file >> descriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do >> this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV >> ioctl(2) operation to read information about a notification >> event; this operation blocks until an event is available. The > > Maybe mention that users can choose to either use the blocking ioctl() > directly or use poll semantics and point to the section below. Thanks. I added mention of the poll/select/epoll here. > (Do we support O_NONBLOCK with SECCOMP_IOCTL_NOTIF_RECV and if not should > we?) A quick test suggests that O_NONBLOCK has no effect on the blocking behavior of SECCOMP_IOCTL_NOTIF_RECV. (I've added your question and this info as a FIXME in the page.) >> operation returns a seccomp_notif structure containing >> information about the system call that is being attempted by >> the target. >> >> 6. The seccomp_notif structure returned by the >> SECCOMP_IOCTL_NOTIF_RECV operation includes the same >> information (a seccomp_data structure) that was passed to the >> seccomp filter. This information allows the supervisor to >> discover the system call number and the arguments for the >> target's system call. In addition, the notification event >> contains the ID of the thread that triggered the notification >> and a unique cookie value that is used in subsequent >> SECCOMP_IOCTL_NOTIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND >> operations. >> >> The information in the notification can be used to discover >> the values of pointer arguments for the target's system call. >> (This is something that can't be done from within a seccomp >> filter.) One way in which the supervisor can do this is to >> open the corresponding /proc/[tid]/mem file (see proc(5)) and >> read bytes from the location that corresponds to one of the >> pointer arguments whose value is supplied in the notification >> event. (The supervisor must be careful to avoid a race >> condition that can occur when doing this; see the description >> of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) >> In addition, the supervisor can access other system >> information that is visible in user space but which is not >> accessible from a seccomp filter. >> >> 7. Having obtained information as per the previous step, the >> supervisor may then choose to perform an action in response to >> the target's system call (which, as noted above, is not >> executed when the seccomp filter returns the >> SECCOMP_RET_USER_NOTIF action value). >> >> One example use case here relates to containers. The target >> may be located inside a container where it does not have >> sufficient capabilities to mount a filesystem in the >> container's mount namespace. However, the supervisor may be a >> more privileged process that does have sufficient capabilities >> to perform the mount operation. >> >> 8. The supervisor then sends a response to the notification. The >> information in this response is used by the kernel to >> construct a return value for the target's system call and >> provide a value that will be assigned to the errno variable of >> the target. >> >> The response is sent using the SECCOMP_IOCTL_NOTIF_SEND >> ioctl(2) operation, which is used to transmit a >> seccomp_notif_resp structure to the kernel. This structure >> includes a cookie value that the supervisor obtained in the >> seccomp_notif structure returned by the >> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows >> the kernel to associate the response with the target. This >> structure must include the cookie value that the supervisor >> obtained in the seccomp_notif structure returned by the >> SECCOMP_IOCTL_NOTIF_RECV operation; the cookie allows the >> kernel to associate the response with the target. >> >> 9. Once the notification has been sent, the system call in the >> target thread unblocks, returning the information that was >> provided by the supervisor in the notification response. >> >> As a variation on the last two steps, the supervisor can send a >> response that tells the kernel that it should execute the target >> thread's system call; see the discussion of >> SECCOMP_USER_NOTIF_FLAG_CONTINUE, below. >> >> ioctl(2) operations >> The following ioctl(2) operations are provided to support seccomp >> user-space notification. For each of these operations, the first > > Hm, since the ioctls() are associatd with the seccomp notify file > descriptor maybe we should rephrase this a bit to make this more > obvious: > "[...] ioctl(2) operations are supported by the seccomp user-space file descriptor" > That might line-uper better with the following sentence. Just a thought, > feel free to ignore. Yep, your idea is better. I changed the text. >> (file descriptor) argument of ioctl(2) is the listening file >> descriptor returned by a call to seccomp(2) with the >> SECCOMP_FILTER_FLAG_NEW_LISTENER flag. >> >> SECCOMP_IOCTL_NOTIF_RECV >> This operation is used to obtain a user-space notification >> event. If no such event is currently pending, the >> operation blocks until an event occurs. The third >> ioctl(2) argument is a pointer to a structure of the >> following form which contains information about the event. >> This structure must be zeroed out before the call. >> >> struct seccomp_notif { >> __u64 id; /* Cookie */ >> __u32 pid; /* TID of target thread */ >> __u32 flags; /* Currently unused (0) */ >> struct seccomp_data data; /* See seccomp(2) */ >> }; >> >> The fields in this structure are as follows: >> >> id This is a cookie for the notification. Each such >> cookie is guaranteed to be unique for the >> corresponding seccomp filter. >> >> · It can be used with the >> SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation >> to verify that the target is still alive. >> >> · When returning a notification response to the >> kernel, the supervisor must include the cookie >> value in the seccomp_notif_resp structure that is >> specified as the argument of the >> SECCOMP_IOCTL_NOTIF_SEND operation. >> >> pid This is the thread ID of the target thread that >> triggered the notification event. >> >> flags This is a bit mask of flags providing further >> information on the event. In the current >> implementation, this field is always zero. > > I think we haven't settled whether this is input or output only. I guess > we could technically use it for both. So, change something here in the page? > >> >> data This is a seccomp_data structure containing >> information about the system call that triggered >> the notification. This is the same structure that >> is passed to the seccomp filter. See seccomp(2) >> for details of this structure. >> >> On success, this operation returns 0; on failure, -1 is >> returned, and errno is set to indicate the cause of the >> error. This operation can fail with the following errors: >> >> EINVAL (since Linux 5.5) >> The seccomp_notif structure that was passed to the >> call contained nonzero fields. >> >> ENOENT The target thread was killed by a signal as the >> notification information was being generated, or >> the target's (blocked) system call was interrupted >> by a signal handler. > > (Technically also EFAULT because the user provided a garbage address.) Yeah. But that error is kind of presumed anywhere a pointer is provided. >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │From my experiments, it appears that if a │ >> │SECCOMP_IOCTL_NOTIF_RECV is done after the target │ >> │thread terminates, then the ioctl() simply blocks │ >> │(rather than returning an error to indicate that the │ >> │target no longer exists). │ >> │ │ >> │I found that surprising, and it required some │ >> │contortions in the example program. It was not │ >> │possible to code my SIGCHLD handler (which reaps the │ >> │zombie when the worker/target terminates) to simply │ >> │set a flag checked in the main handleNotifications() │ >> │loop, since this created an unavoidable race where │ >> │the child might terminate just after I had checked │ >> │the flag, but before I blocked (forever!) in the │ >> │SECCOMP_IOCTL_NOTIF_RECV operation. Instead, I had │ >> │to code the signal handler to simply call _exit(2) │ >> │in order to terminate the parent process (the │ >> │supervisor). │ >> │ │ >> │Is this expected behavior? It seems to me rather │ >> │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │ >> │an error if the target has terminated. │ >> │ │ >> │Jann posted a patch to rectify this, but there was │ >> │no response (Lore link: https://bit.ly/3jvUBxk) to │ >> │his question about fixing this issue. (I've tried │ >> │building with the patch, but encountered an issue │ >> │with the target process entering D state after a │ >> │signal.) │ >> │ │ >> │For now, this behavior is documented in BUGS. │ >> │ │ >> │Kees Cook commented: Let's change [this] ASAP! │ >> └─────────────────────────────────────────────────────┘ >> >> SECCOMP_IOCTL_NOTIF_ID_VALID >> This operation can be used to check that a notification ID >> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation >> is still valid (i.e., that the target still exists and its >> system call is still blocked waiting for a response). >> >> The third ioctl(2) argument is a pointer to the cookie >> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation. >> >> This operation is necessary to avoid race conditions that >> can occur when the pid returned by the >> SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that >> process ID is reused by another process. An example of >> this kind of race is the following >> >> 1. A notification is generated on the listening file >> descriptor. The returned seccomp_notif contains the >> TID of the target thread (in the pid field of the >> structure). >> >> 2. The target terminates. >> >> 3. Another thread or process is created on the system that >> by chance reuses the TID that was freed when the target >> terminated. >> >> 4. The supervisor open(2)s the /proc/[tid]/mem file for >> the TID obtained in step 1, with the intention of (say) >> inspecting the memory location(s) that containing the >> argument(s) of the system call that triggered the >> notification in step 1. >> >> In the above scenario, the risk is that the supervisor may >> try to access the memory of a process other than the >> target. This race can be avoided by following the call to >> open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to >> verify that the process that generated the notification is >> still alive. (Note that if the target terminates after >> the latter step, a subsequent read(2) from the file >> descriptor may return 0, indicating end of file.) >> >> On success (i.e., the notification ID is still valid), >> this operation returns 0. On failure (i.e., the >> notification ID is no longer valid), -1 is returned, and >> errno is set to ENOENT. >> >> SECCOMP_IOCTL_NOTIF_SEND >> This operation is used to send a notification response >> back to the kernel. The third ioctl(2) argument of this >> structure is a pointer to a structure of the following >> form: >> >> struct seccomp_notif_resp { >> __u64 id; /* Cookie value */ >> __s64 val; /* Success return value */ >> __s32 error; /* 0 (success) or negative >> error number */ >> __u32 flags; /* See below */ >> }; >> >> The fields of this structure are as follows: >> >> id This is the cookie value that was obtained using >> the SECCOMP_IOCTL_NOTIF_RECV operation. This >> cookie value allows the kernel to correctly >> associate this response with the system call that >> triggered the user-space notification. >> >> val This is the value that will be used for a spoofed >> success return for the target's system call; see >> below. >> >> error This is the value that will be used as the error >> number (errno) for a spoofed error return for the >> target's system call; see below. >> >> flags This is a bit mask that includes zero or more of >> the following flags: >> >> SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5) >> Tell the kernel to execute the target's >> system call. >> >> Two kinds of response are possible: >> >> · A response to the kernel telling it to execute the >> target's system call. In this case, the flags field >> includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error >> and val fields must be zero. >> >> This kind of response can be useful in cases where the >> supervisor needs to do deeper analysis of the target's >> system call than is possible from a seccomp filter >> (e.g., examining the values of pointer arguments), and, >> having decided that the system call does not require >> emulation by the supervisor, the supervisor wants the >> system call to be executed normally in the target. >> >> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used >> with caution; see NOTES. >> >> · A spoofed return value for the target's system call. In >> this case, the kernel does not execute the target's >> system call, instead causing the system call to return a >> spoofed value as specified by fields of the >> seccomp_notif_resp structure. The supervisor should set >> the fields of this structure as follows: >> >> + flags does not contain >> SECCOMP_USER_NOTIF_FLAG_CONTINUE. >> >> + error is set either to 0 for a spoofed "success" >> return or to a negative error number for a spoofed >> "failure" return. In the former case, the kernel >> causes the target's system call to return the value >> specified in the val field. In the later case, the > > Not a native English speaker but shouldn't this be "latter"? Yup! >> kernel causes the target's system call to return -1, >> and errno is assigned the negated error value. >> >> + val is set to a value that will be used as the return >> value for a spoofed "success" return for the target's >> system call. The value in this field is ignored if >> the error field contains a nonzero value. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Kees Cook suggested: │ >> │ │ >> │Strictly speaking, this is architecture specific, │ >> │but all architectures do it this way. Should seccomp │ >> │enforce val == 0 when err != 0 ? │ > > Feels like it should, at least for the SEND ioctl where we already > verify that val and err are both 0 when CONTINUE is specified (as you > pointed out correctly above). Your comments ended here (with no sign-off). Was that intentional? Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers