On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote: > On 10/1/20 1:39 AM, Kees Cook wrote: > > I'll comment more later, but I've run out of time today and I didn't see > > anyone mention this detail yet in the existing threads... :) > > Later never came :-). But, I hope you may have comments for the > next draft, which I will send out soon. Later is now, and Soon approaches! I finally caught up and read through this whole thread. Thank you all for the bug fix[1], and I'm looking forward to more[2]. :) For my reply I figured I'd base it on the current draft, so here's a simulated quote based on the seccomp_user_notif branch of https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git through commit 71101158fe330af5a26552447a0bb433b69e15b7 $ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /' On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote: > SECCOMP_USER_NOTIF(2) Linux Programmer's Manual SECCOMP_USER_NOTIF(2) > > NAME > seccomp_user_notif - Seccomp user-space notification mechanism > > SYNOPSIS > #include <linux/seccomp.h> > #include <linux/filter.h> > #include <linux/audit.h> > > int seccomp(unsigned int operation, unsigned int flags, void *args); > > #include <sys/ioctl.h> > > int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV, > struct seccomp_notif *req); > int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND, > struct seccomp_notif_resp *resp); > int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id); > > DESCRIPTION > This page describes the user-space notification mechanism provided > by the Secure Computing (seccomp) facility. As well as the use of > the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the > SECCOMP_RET_USER_NOTIF action value, and the > SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this > mechanism involves the use of a number of related ioctl(2) > operations (described below). > > Overview > In conventional usage of a seccomp filter, the decision about how > to treat a system call is made by the filter itself. By contrast, > the user-space notification mechanism allows the seccomp filter to > delegate the handling of the system call to another user-space > process. Note that this mechanism is explicitly not intended as a > method implementing security policy; see NOTES. > > In the discussion that follows, the thread(s) on which the seccomp > filter is installed is (are) referred to as the target, and the > process that is notified by the user-space notification mechanism > is referred to as the supervisor. > > A suitably privileged supervisor can use the user-space > notification mechanism to perform actions on behalf of the target. > The advantage of the user-space notification mechanism is that the > supervisor will usually be able to retrieve information about the > target and the performed system call that the seccomp filter > itself cannot. (A seccomp filter is limited in the information it > can obtain and the actions that it can perform because it is > running on a virtual machine inside the kernel.) > > An overview of the steps performed by the target and the > supervisor is as follows: > > 1. The target establishes a seccomp filter in the usual manner, > but with two differences: > > • The seccomp(2) flags argument includes the flag > SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return > value of the (successful) seccomp(2) call is a new nit: extra space > "listening" file descriptor that can be used to receive > notifications. Only one "listening" seccomp filter can be > installed for a thread. I like this limitation, but I expect that it'll need to change in the future. Even with LSMs, we see the need for arbitrary stacking, and the idea of there being only 1 supervisor will eventually break down. Right now there is only 1 because only container managers are using this feature. But if some daemon starts using it to isolate some thread, suddenly it might break if a container manager is trying to listen to it too, etc. I expect it won't be needed soon, but I do think it'll change. > > • In cases where it is appropriate, the seccomp filter returns > the action value SECCOMP_RET_USER_NOTIF. This return value > will trigger a notification event. > > 2. In order that the supervisor can obtain notifications using the > listening file descriptor, (a duplicate of) that file > descriptor must be passed from the target to the supervisor. Yet another reason to have an "activate on exec" mode for seccomp. With no_new_privs _not_ being delayed in such a way, I think it'd be safe to add. The supervisor would get the fd immediately, and then once it fork/execed suddenly the whole thing would activate, and no fd passing needed. The "on exec" boundary is really only needed for oblivious targets. For a coordinated target, I've thought it might be nice to have an arbitrary "go" point, where the target could call seccomp() with something like a SECCOMP_ACTIVATE_DELAYED_FILTERS operation. This lets any process initialization happen that might need to do things that would be blocked by filters, etc. Before: fork install some filters that don't block initialization exec do some initialization install more filters, maybe block exec, seccomp run After: fork install delayed filters exec do some initialization activate delayed filters run In practice, the two-stage filter application has been fine, if sometimes a bit complex (e.g. for user_notif, "do some initialization" includes figuring out how to pass the fd back to the supervisor, etc). > One way in which this could be done is by passing the file > descriptor over a UNIX domain socket connection between the > target and the supervisor (using the SCM_RIGHTS ancillary > message type described in unix(7)). > > 3. The supervisor will receive notification events on the > listening file descriptor. These events are returned as > structures of type seccomp_notif. Because this structure and > its size may evolve over kernel versions, the supervisor must > first determine the size of this structure using the seccomp(2) > SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of > type seccomp_notif_sizes. The supervisor allocates a buffer of > size seccomp_notif_sizes.seccomp_notif bytes to receive > notification events. In addition,the supervisor allocates > another buffer of size seccomp_notif_sizes.seccomp_notif_resp > bytes for the response (a struct seccomp_notif_resp structure) > that it will provide to the kernel (and thus the target). > > 4. The target then performs its workload, which includes system > calls that will be controlled by the seccomp filter. Whenever > one of these system calls causes the filter to return the > SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet) > execute the system call; instead, execution of the target is > temporarily blocked inside the kernel (in a sleep state that is > interruptible by signals) and a notification event is generated > on the listening file descriptor. > > 5. The supervisor can now repeatedly monitor the listening file > descriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do > this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) > operation to read information about a notification event; this > operation blocks until an event is available. The operation > returns a seccomp_notif structure containing information about > the system call that is being attempted by the target. > > 6. The seccomp_notif structure returned by the > SECCOMP_IOCTL_NOTIF_RECV operation includes the same > information (a seccomp_data structure) that was passed to the > seccomp filter. This information allows the supervisor to > discover the system call number and the arguments for the > target's system call. In addition, the notification event > contains the ID of the thread that triggered the notification. Should "cookie" be at least named here, just to provide a bit more context for when it is mentioned in 8 below? E.g.: ... In addition, the notification event contains the triggering thread's ID and a unique cookie to be used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations. > > The information in the notification can be used to discover the > values of pointer arguments for the target's system call. > (This is something that can't be done from within a seccomp > filter.) One way in which the supervisor can do this is to > open the corresponding /proc/[tid]/mem file (see proc(5)) and > read bytes from the location that corresponds to one of the > pointer arguments whose value is supplied in the notification > event. (The supervisor must be careful to avoid a race > condition that can occur when doing this; see the description > of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) > In addition, the supervisor can access other system information > that is visible in user space but which is not accessible from > a seccomp filter. > > 7. Having obtained information as per the previous step, the > supervisor may then choose to perform an action in response to > the target's system call (which, as noted above, is not > executed when the seccomp filter returns the > SECCOMP_RET_USER_NOTIF action value). > > One example use case here relates to containers. The target > may be located inside a container where it does not have > sufficient capabilities to mount a filesystem in the > container's mount namespace. However, the supervisor may be a > more privileged process that does have sufficient capabilities > to perform the mount operation. > > 8. The supervisor then sends a response to the notification. The > information in this response is used by the kernel to construct > a return value for the target's system call and provide a value > that will be assigned to the errno variable of the target. > > The response is sent using the SECCOMP_IOCTL_NOTIF_SEND > ioctl(2) operation, which is used to transmit a > seccomp_notif_resp structure to the kernel. This structure > includes a cookie value that the supervisor obtained in the > seccomp_notif structure returned by the > SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows > the kernel to associate the response with the target. Describing where the cookie came from seems like it should live in 6 above. A reader would have to take this new info and figure out where SECCOMP_IOCTL_NOTIF_RECV was described and piece it together. With the suggestion to 6 above, maybe: ... This structure must include the cookie value that the supervisor obtained in the seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation, which allows the kernel to associate the response with the target. > > 9. Once the notification has been sent, the system call in the > target thread unblocks, returning the information that was > provided by the supervisor in the notification response. > > As a variation on the last two steps, the supervisor can send a > response that tells the kernel that it should execute the target > thread's system call; see the discussion of > SECCOMP_USER_NOTIF_FLAG_CONTINUE, below. > > ioctl(2) operations > The following ioctl(2) operations are provided to support seccomp > user-space notification. For each of these operations, the first > (file descriptor) argument of ioctl(2) is the listening file > descriptor returned by a call to seccomp(2) with the > SECCOMP_FILTER_FLAG_NEW_LISTENER flag. > > SECCOMP_IOCTL_NOTIF_RECV > This operation is used to obtain a user-space notification > event. If no such event is currently pending, the > operation blocks until an event occurs. The third ioctl(2) > argument is a pointer to a structure of the following form > which contains information about the event. This structure > must be zeroed out before the call. > > struct seccomp_notif { > __u64 id; /* Cookie */ > __u32 pid; /* TID of target thread */ Should we rename this variable from pid to tid? Yes it's UAPI, but yay for anonymous unions: struct seccomp_notif { __u64 id; /* Cookie */ union { __u32 pid; __u32 tid; /* TID of target thread */ }; __u32 flags; /* Currently unused (0) */ struct seccomp_data data; /* See seccomp(2) */ }; > __u32 flags; /* Currently unused (0) */ > struct seccomp_data data; /* See seccomp(2) */ > }; > > The fields in this structure are as follows: > > id This is a cookie for the notification. Each such > cookie is guaranteed to be unique for the > corresponding seccomp filter. > > • It can be used with the > SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to > verify that the target is still alive. > > • When returning a notification response to the > kernel, the supervisor must include the cookie > value in the seccomp_notif_resp structure that is > specified as the argument of the > SECCOMP_IOCTL_NOTIF_SEND operation. > > pid This is the thread ID of the target thread that > triggered the notification event. > > flags This is a bit mask of flags providing further > information on the event. In the current > implementation, this field is always zero. > > data This is a seccomp_data structure containing > information about the system call that triggered the > notification. This is the same structure that is > passed to the seccomp filter. See seccomp(2) for > details of this structure. > > On success, this operation returns 0; on failure, -1 is > returned, and errno is set to indicate the cause of the > error. This operation can fail with the following errors: > > EINVAL (since Linux 5.5) > The seccomp_notif structure that was passed to the > call contained nonzero fields. > > ENOENT The target thread was killed by a signal as the > notification information was being generated, or the > target's (blocked) system call was interrupted by a > signal handler. > > SECCOMP_IOCTL_NOTIF_ID_VALID > This operation can be used to check that a notification ID > returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation > is still valid (i.e., that the target still exists). Maybe clarify a bit more, since it's covering more than just "is the target still alive", but also "is that syscall still waiting for a response": is still valid (i.e., that the target still exists and the syscall is still blocked waiting for a response). > > The third ioctl(2) argument is a pointer to the cookie (id) > returned by the SECCOMP_IOCTL_NOTIF_RECV operation. > > This operation is necessary to avoid race conditions that > can occur when the pid returned by the > SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that > process ID is reused by another process. An example of > this kind of race is the following > > 1. A notification is generated on the listening file > descriptor. The returned seccomp_notif contains the TID > of the target thread (in the pid field of the > structure). > > 2. The target terminates. > > 3. Another thread or process is created on the system that > by chance reuses the TID that was freed when the target > terminated. > > 4. The supervisor open(2)s the /proc/[tid]/mem file for the > TID obtained in step 1, with the intention of (say) > inspecting the memory location(s) that containing the > argument(s) of the system call that triggered the > notification in step 1. > > In the above scenario, the risk is that the supervisor may > try to access the memory of a process other than the > target. This race can be avoided by following the call to > open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to > verify that the process that generated the notification is > still alive. (Note that if the target terminates after the > latter step, a subsequent read(2) from the file descriptor > may return 0, indicating end of file.) > > On success (i.e., the notification ID is still valid), this > operation returns 0. On failure (i.e., the notification ID > is no longer valid), -1 is returned, and errno is set to > ENOENT. > > SECCOMP_IOCTL_NOTIF_SEND > This operation is used to send a notification response back > to the kernel. The third ioctl(2) argument of this > structure is a pointer to a structure of the following > form: > > struct seccomp_notif_resp { > __u64 id; /* Cookie value */ > __s64 val; /* Success return value */ > __s32 error; /* 0 (success) or negative > error number */ > __u32 flags; /* See below */ > }; > > The fields of this structure are as follows: > > id This is the cookie value that was obtained using the > SECCOMP_IOCTL_NOTIF_RECV operation. This cookie > value allows the kernel to correctly associate this > response with the system call that triggered the > user-space notification. > > val This is the value that will be used for a spoofed > success return for the target's system call; see > below. > > error This is the value that will be used as the error > number (errno) for a spoofed error return for the > target's system call; see below. > > flags This is a bit mask that includes zero or more of the > following flags: > > SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5) > Tell the kernel to execute the target's > system call. > > Two kinds of response are possible: > > • A response to the kernel telling it to execute the > target's system call. In this case, the flags field > includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error > and val fields must be zero. > > This kind of response can be useful in cases where the > supervisor needs to do deeper analysis of the target's > system call than is possible from a seccomp filter (e.g., > examining the values of pointer arguments), and, having > decided that the system call does not require emulation > by the supervisor, the supervisor wants the system call > to be executed normally in the target. > > The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used > with caution; see NOTES. > > • A spoofed return value for the target's system call. In > this case, the kernel does not execute the target's > system call, instead causing the system call to return a > spoofed value as specified by fields of the > seccomp_notif_resp structure. The supervisor should set > the fields of this structure as follows: > > + flags does not contain > SECCOMP_USER_NOTIF_FLAG_CONTINUE. > > + error is set either to 0 for a spoofed "success" > return or to a negative error number for a spoofed > "failure" return. In the former case, the kernel > causes the target's system call to return the value > specified in the val field. In the later case, the > kernel causes the target's system call to return -1, > and errno is assigned the negated error value. > > + val is set to a value that will be used as the return > value for a spoofed "success" return for the target's > system call. The value in this field is ignored if > the error field contains a nonzero value. Strictly speaking, this is architecture specific, but all architectures do it this way. Should seccomp enforce val == 0 when err != 0 ? > > On success, this operation returns 0; on failure, -1 is > returned, and errno is set to indicate the cause of the > error. This operation can fail with the following errors: > > EINPROGRESS > A response to this notification has already been > sent. > > EINVAL An invalid value was specified in the flags field. > > EINVAL The flags field contained > SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or > val field was not zero. > > ENOENT The blocked system call in the target has been > interrupted by a signal handler or the target has > terminated. > > NOTES > select()/poll()/epoll semantics > The file descriptor returned when seccomp(2) is employed with the > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using > poll(2), epoll(7), and select(2). These interfaces indicate that > the file descriptor is ready as follows: > > • When a notification is pending, these interfaces indicate that > the file descriptor is readable. Following such an indication, > a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, > returning either information about a notification or else > failing with the error EINTR if the target has been killed by a > signal or its system call has been interrupted by a signal > handler. > > • After the notification has been received (i.e., by the > SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces > indicate that the file descriptor is writable, meaning that a > notification response can be sent using the > SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation. > > • After the last thread using the filter has terminated and been > reaped using waitpid(2) (or similar), the file descriptor > indicates an end-of-file condition (readable in select(2); > POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)). I'll reply separately about the "ioctl() does not terminate when all filters have terminated" case. > > Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE > The intent of the user-space notification feature is to allow > system calls to be performed on behalf of the target. The > target's system call should either be handled by the supervisor or > allowed to continue normally in the kernel (where standard > security policies will be applied). > > Note well: this mechanism must not be used to make security policy > decisions about the system call, which would be inherently race- > prone for reasons described next. > > The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with > caution. If set by the supervisor, the target's system call will > continue. However, there is a time-of-check, time-of-use race > here, since an attacker could exploit the interval of time where > the target is blocked waiting on the "continue" response to do > things such as rewriting the system call arguments. > > Note furthermore that a user-space notifier can be bypassed if the > existing filters allow the use of seccomp(2) or prctl(2) to > install a filter that returns an action value with a higher > precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)). > > It should thus be absolutely clear that the seccomp user-space > notification mechanism can not be used to implement a security > policy! It should only ever be used in scenarios where a more > privileged process supervises the system calls of a lesser > privileged target to get around kernel-enforced security > restrictions when the supervisor deems this safe. In other words, > in order to continue a system call, the supervisor should be sure > that another security mechanism or the kernel itself will > sufficiently block the system call if its arguments are rewritten > to something unsafe. > > Interaction with SA_RESTART signal handlers > Consider the following scenario: > > • The target process has used sigaction(2) to install a signal > handler with the SA_RESTART flag. > > • The target has made a system call that triggered a seccomp user- > space notification and the target is currently blocked until the > supervisor sends a notification response. > > • A signal is delivered to the target and the signal handler is > executed. > > • When (if) the supervisor attempts to send a notification > response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will > fail with the ENOENT error. > > In this scenario, the kernel will restart the target's system > call. Consequently, the supervisor will receive another user- > space notification. Thus, depending on how many times the blocked > system call is interrupted by a signal handler, the supervisor may > receive multiple notifications for the same system call in the maybe "... for the same instance of a system call in the target." for clarity? > target. > > One oddity is that system call restarting as described in this > scenario will occur even for the blocking system calls listed in > signal(7) that would never normally be restarted by the SA_RESTART > flag. Does this need fixing? I imagine the correct behavior for this case would be a response to _SEND of EINPROGRESS and the target would see EINTR normally? I mean, it's not like seccomp doesn't already expose weirdness with syscall restarts. Not even arm64 compat agrees[3] with arm32 in this regard. :( > BUGS > If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed > after the target terminates, then the ioctl(2) call simply blocks > (rather than returning an error to indicate that the target no > longer exists). I want this fixed. It caused me no end of pain when building the selftests, and ended up spawning my implementing a global test timeout in kselftest. :P Before the usage counter refactor, there was no sane way to deal with this, but now I think we're close[2]. I'll reply separately about this. > > EXAMPLES > The (somewhat contrived) program shown below demonstrates the use > of the interfaces described in this page. The program creates a > child process that serves as the "target" process. The child > process installs a seccomp filter that returns the > SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2). > The child process then calls mkdir(2) once for each of the > supplied command-line arguments, and reports the result returned > by the call. After processing all arguments, the child process > terminates. > > The parent process acts as the supervisor, listening for the > notifications that are generated when the target process calls > mkdir(2). When such a notification occurs, the supervisor > examines the memory of the target process (using /proc/[pid]/mem) > to discover the pathname argument that was supplied to the > mkdir(2) call, and performs one of the following actions: I like this example! It's simple enough to be understandable and complex enough to show the purpose of user_notif. :) > > • If the pathname begins with the prefix "/tmp/", then the > supervisor attempts to create the specified directory, and then > spoofs a return for the target process based on the return value > of the supervisor's mkdir(2) call. In the event that that call > succeeds, the spoofed success return value is the length of the > pathname. > > • If the pathname begins with "./" (i.e., it is a relative > pathname), the supervisor sends a > SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say > that the kernel should execute the target process's mkdir(2) > call. > > • If the pathname begins with some other prefix, the supervisor > spoofs an error return for the target process, so that the > target process's mkdir(2) call appears to fail with the error > EOPNOTSUPP ("Operation not supported"). Additionally, if the > specified pathname is exactly "/bye", then the supervisor > terminates. > > This program can be used to demonstrate various aspects of the > behavior of the seccomp user-space notification mechanism. To > help aid such demonstrations, the program logs various messages to > show the operation of the target process (lines prefixed "T:") and > the supervisor (indented lines prefixed "S:"). > > In the following example, the target attempts to create the > directory /tmp/x. Upon receiving the notification, the supervisor > creates the directory on the target's behalf, and spoofs a success > return to be received by the target process's mkdir(2) call. > > $ ./seccomp_unotify /tmp/x > T: PID = 23168 > > T: about to mkdir("/tmp/x") > S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168 > S: executing: mkdir("/tmp/x", 0700) > S: success! spoofed return = 6 > S: sending response (flags = 0; val = 6; error = 0) > T: SUCCESS: mkdir(2) returned 6 > > T: terminating > S: target has terminated; bye > > In the above output, note that the spoofed return value seen by > the target process is 6 (the length of the pathname /tmp/x), > whereas a normal mkdir(2) call returns 0 on success. > > In the next example, the target attempts to create a directory > using the relative pathname ./sub. Since this pathname starts > with "./", the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE > response to the kernel, and the kernel then (successfully) > executes the target process's mkdir(2) call. > > $ ./seccomp_unotify ./sub > T: PID = 23204 > > T: about to mkdir("./sub") > S: got notification (ID 0xddb16abe25b4c12) for PID 23204 > S: target can execute system call > S: sending response (flags = 0x1; val = 0; error = 0) > T: SUCCESS: mkdir(2) returned 0 > > T: terminating > S: target has terminated; bye > > If the target process attempts to create a directory with a > pathname that doesn't start with "." and doesn't begin with the > prefix "/tmp/", then the supervisor spoofs an error return > (EOPNOTSUPP, "Operation not supported") for the target's mkdir(2) > call (which is not executed): > > $ ./seccomp_unotify /xxx > T: PID = 23178 > > T: about to mkdir("/xxx") > S: got notification (ID 0xe7dc095d1c524e80) for PID 23178 > S: spoofing error response (Operation not supported) > S: sending response (flags = 0; val = 0; error = -95) > T: ERROR: mkdir(2): Operation not supported > > T: terminating > S: target has terminated; bye > > In the next example, the target process attempts to create a > directory with the pathname /tmp/nosuchdir/b. Upon receiving the > notification, the supervisor attempts to create that directory, > but the mkdir(2) call fails because the directory /tmp/nosuchdir > does not exist. Consequently, the supervisor spoofs an error > return that passes the error that it received back to the target > process's mkdir(2) call. > > $ ./seccomp_unotify /tmp/nosuchdir/b > T: PID = 23199 > > T: about to mkdir("/tmp/nosuchdir/b") > S: got notification (ID 0x8744454293506046) for PID 23199 > S: executing: mkdir("/tmp/nosuchdir/b", 0700) > S: failure! (errno = 2; No such file or directory) > S: sending response (flags = 0; val = 0; error = -2) > T: ERROR: mkdir(2): No such file or directory > > T: terminating > S: target has terminated; bye > > If the supervisor receives a notification and sees that the > argument of the target's mkdir(2) is the string "/bye", then (as > well as spoofing an EOPNOTSUPP error), the supervisor terminates. > If the target process subsequently executes another mkdir(2) that > triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF > action value, then the kernel causes the target process's system > call to fail with the error ENOSYS ("Function not implemented"). > This is demonstrated by the following example: > > $ ./seccomp_unotify /bye /tmp/y > T: PID = 23185 > > T: about to mkdir("/bye") > S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185 > S: spoofing error response (Operation not supported) > S: sending response (flags = 0; val = 0; error = -95) > S: terminating ********** > T: ERROR: mkdir(2): Operation not supported > > T: about to mkdir("/tmp/y") > T: ERROR: mkdir(2): Function not implemented > > T: terminating > > Program source > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/prctl.h> > #include <fcntl.h> > #include <limits.h> > #include <signal.h> > #include <stddef.h> > #include <stdint.h> > #include <stdbool.h> > #include <linux/audit.h> > #include <sys/syscall.h> > #include <sys/stat.h> > #include <linux/filter.h> > #include <linux/seccomp.h> > #include <sys/ioctl.h> > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <errno.h> > #include <sys/socket.h> > #include <sys/un.h> > > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ > } while (0) Because I love macros, you can expand this to make it take a format string: #define errExit(fmt, ...) do { \ char __err[64]; \ strerror_r(errno, __err, sizeof(__err)); \ fprintf(stderr, fmt ": %s\n", ##__VA_ARG__, __err); \ exit(EXIT_FAILURE); \ } while (0) > > /* Send the file descriptor 'fd' over the connected UNIX domain socket > 'sockfd'. Returns 0 on success, or -1 on error. */ > > static int > sendfd(int sockfd, int fd) > { > struct msghdr msgh; > struct iovec iov; > int data; > struct cmsghdr *cmsgp; > > /* Allocate a char array of suitable size to hold the ancillary data. > However, since this buffer is in reality a 'struct cmsghdr', use a > union to ensure that it is suitably aligned. */ > union { > char buf[CMSG_SPACE(sizeof(int))]; > /* Space large enough to hold an 'int' */ > struct cmsghdr align; > } controlMsg; > > /* The 'msg_name' field can be used to specify the address of the > destination socket when sending a datagram. However, we do not > need to use this field because 'sockfd' is a connected socket. */ > > msgh.msg_name = NULL; > msgh.msg_namelen = 0; > > /* On Linux, we must transmit at least one byte of real data in > order to send ancillary data. We transmit an arbitrary integer > whose value is ignored by recvfd(). */ > > msgh.msg_iov = &iov; > msgh.msg_iovlen = 1; > iov.iov_base = &data; > iov.iov_len = sizeof(int); > data = 12345; > > /* Set 'msghdr' fields that describe ancillary data */ > > msgh.msg_control = controlMsg.buf; > msgh.msg_controllen = sizeof(controlMsg.buf); > > /* Set up ancillary data describing file descriptor to send */ > > cmsgp = CMSG_FIRSTHDR(&msgh); > cmsgp->cmsg_level = SOL_SOCKET; > cmsgp->cmsg_type = SCM_RIGHTS; > cmsgp->cmsg_len = CMSG_LEN(sizeof(int)); > memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int)); > > /* Send real plus ancillary data */ > > if (sendmsg(sockfd, &msgh, 0) == -1) > return -1; > > return 0; > } > > /* Receive a file descriptor on a connected UNIX domain socket. Returns > the received file descriptor on success, or -1 on error. */ > > static int > recvfd(int sockfd) > { > struct msghdr msgh; > struct iovec iov; > int data, fd; > ssize_t nr; > > /* Allocate a char buffer for the ancillary data. See the comments > in sendfd() */ > union { > char buf[CMSG_SPACE(sizeof(int))]; > struct cmsghdr align; > } controlMsg; > struct cmsghdr *cmsgp; > > /* The 'msg_name' field can be used to obtain the address of the > sending socket. However, we do not need this information. */ > > msgh.msg_name = NULL; > msgh.msg_namelen = 0; > > /* Specify buffer for receiving real data */ > > msgh.msg_iov = &iov; > msgh.msg_iovlen = 1; > iov.iov_base = &data; /* Real data is an 'int' */ > iov.iov_len = sizeof(int); > > /* Set 'msghdr' fields that describe ancillary data */ > > msgh.msg_control = controlMsg.buf; > msgh.msg_controllen = sizeof(controlMsg.buf); > > /* Receive real plus ancillary data; real data is ignored */ > > nr = recvmsg(sockfd, &msgh, 0); > if (nr == -1) > return -1; > > cmsgp = CMSG_FIRSTHDR(&msgh); > > /* Check the validity of the 'cmsghdr' */ > > if (cmsgp == NULL || > cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) || > cmsgp->cmsg_level != SOL_SOCKET || > cmsgp->cmsg_type != SCM_RIGHTS) { > errno = EINVAL; > return -1; > } > > /* Return the received file descriptor to our caller */ > > memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int)); > return fd; > } > > static void > sigchldHandler(int sig) > { > char *msg = "\tS: target has terminated; bye\n"; > > write(STDOUT_FILENO, msg, strlen(msg)); white space nit: extra space before "=" efficiency nit: strlen isn't needed, since it can be done with compile-time constant constants: char msg[] = "\tS: target has terminated; bye\n"; write(STDOUT_FILENO, msg, sizeof(msg) - 1); (some optimization levels may already replace the strlen a sizeof - 1) > _exit(EXIT_SUCCESS); > } > > static int > seccomp(unsigned int operation, unsigned int flags, void *args) > { > return syscall(__NR_seccomp, operation, flags, args); > } > > /* The following is the x86-64-specific BPF boilerplate code for checking > that the BPF program is running on the right architecture + ABI. At > completion of these instructions, the accumulator contains the system > call number. */ > > /* For the x32 ABI, all system call numbers have bit 30 set */ > > #define X32_SYSCALL_BIT 0x40000000 > > #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \ > BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \ > (offsetof(struct seccomp_data, arch))), \ > BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \ > BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \ > (offsetof(struct seccomp_data, nr))), \ > BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \ > BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS) > > /* installNotifyFilter() installs a seccomp filter that generates > user-space notifications (SECCOMP_RET_USER_NOTIF) when the process > calls mkdir(2); the filter allows all other system calls. > > The function return value is a file descriptor from which the > user-space notifications can be fetched. */ > > static int > installNotifyFilter(void) > { > struct sock_filter filter[] = { > X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR, > > /* mkdir() triggers notification to user-space supervisor */ > > BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1), > BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF), > > /* Every other system call is allowed */ > > BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), > }; > > struct sock_fprog prog = { > .len = sizeof(filter) / sizeof(filter[0]), > .filter = filter, > }; > > /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag; > as a result, seccomp() returns a notification file descriptor. */ > > int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER, > SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); > if (notifyFd == -1) > errExit("seccomp-install-notify-filter"); > > return notifyFd; > } > > /* Close a pair of sockets created by socketpair() */ > > static void > closeSocketPair(int sockPair[2]) > { > if (close(sockPair[0]) == -1) > errExit("closeSocketPair-close-0"); > if (close(sockPair[1]) == -1) > errExit("closeSocketPair-close-1"); > } > > /* Implementation of the target process; create a child process that: > > (1) installs a seccomp filter with the > SECCOMP_FILTER_FLAG_NEW_LISTENER flag; > (2) writes the seccomp notification file descriptor returned from > the previous step onto the UNIX domain socket, 'sockPair[0]'; > (3) calls mkdir(2) for each element of 'argv'. > > The function return value in the parent is the PID of the child > process; the child does not return from this function. */ > > static pid_t > targetProcess(int sockPair[2], char *argv[]) > { > pid_t targetPid = fork(); > if (targetPid == -1) > errExit("fork"); > > if (targetPid > 0) /* In parent, return PID of child */ > return targetPid; > > /* Child falls through to here */ > > printf("T: PID = %ld\n", (long) getpid()); > > /* Install seccomp filter(s) */ > > if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) > errExit("prctl"); > > int notifyFd = installNotifyFilter(); > > /* Pass the notification file descriptor to the tracing process over > a UNIX domain socket */ > > if (sendfd(sockPair[0], notifyFd) == -1) > errExit("sendfd"); > > /* Notification and socket FDs are no longer needed in target */ > > if (close(notifyFd) == -1) > errExit("close-target-notify-fd"); > > closeSocketPair(sockPair); > > /* Perform a mkdir() call for each of the command-line arguments */ > > for (char **ap = argv; *ap != NULL; ap++) { > printf("\nT: about to mkdir(\"%s\")\n", *ap); > > int s = mkdir(*ap, 0700); > if (s == -1) > perror("T: ERROR: mkdir(2)"); > else > printf("T: SUCCESS: mkdir(2) returned %d\n", s); > } > > printf("\nT: terminating\n"); > exit(EXIT_SUCCESS); > } > > /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV > operation is still valid. It will no longer be valid if the process > has terminated. This operation can be used when accessing /proc/PID > files in the target process in order to avoid TOCTOU race conditions > where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates > and is reused by another process. */ > > static void > checkNotificationIdIsValid(int notifyFd, uint64_t id) > { > if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) { > fprintf(stderr, "\tS: notification ID check: " > "target has terminated!!!\n"); > > exit(EXIT_FAILURE); And now you can do: errExit("\tS: notification ID check: " "target has terminated! ioctl"); ;) > } > } > > /* Access the memory of the target process in order to discover the > pathname that was given to mkdir() */ > > static bool > getTargetPathname(struct seccomp_notif *req, int notifyFd, > char *path, size_t len) > { > char procMemPath[PATH_MAX]; > > snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid); > > int procMemFd = open(procMemPath, O_RDONLY); > if (procMemFd == -1) > errExit("Supervisor: open"); > > /* Check that the process whose info we are accessing is still alive. > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed > in checkNotificationIdIsValid()) succeeds, we know that the > /proc/PID/mem file descriptor that we opened corresponds to the > process for which we received a notification. If that process > subsequently terminates, then read() on that file descriptor > will return 0 (EOF). */ > > checkNotificationIdIsValid(notifyFd, req->id); > > /* Read bytes at the location containing the pathname argument > (i.e., the first argument) of the mkdir(2) call */ > > ssize_t nread = pread(procMemFd, path, len, req->data.args[0]); > if (nread == -1) > errExit("pread"); > > if (nread == 0) { > fprintf(stderr, "\tS: pread() of /proc/PID/mem " > "returned 0 (EOF)\n"); > exit(EXIT_FAILURE); > } > > if (close(procMemFd) == -1) > errExit("close-/proc/PID/mem"); > > /* We have no guarantees about what was in the memory of the target > process. We therefore treat the buffer returned by pread() as > untrusted input. The buffer should be terminated by a null byte; > if not, then we will trigger an error for the target process. */ > > for (int j = 0; j < nread; j++) > if (path[j] == ' ') This rendering typo (' ' vs '\0') ends up manifesting badly. ;) The man source shows: if (path[j] == \(aq\0\(aq) I think this needs to be \\0 ? Or it could also be a tested as: if (strnlen(path, nread) < nread) > return true; > > return false; > } > > /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file > descriptor, 'notifyFd'. */ > > static void > handleNotifications(int notifyFd) > { > struct seccomp_notif_sizes sizes; > char path[PATH_MAX]; > > /* Discover the sizes of the structures that are used to receive > notifications and send notification responses, and allocate > buffers of those sizes. */ > > if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1) > errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES"); > > struct seccomp_notif *req = malloc(sizes.seccomp_notif); > if (req == NULL) > errExit("\tS: malloc"); > > /* When allocating the response buffer, we must allow for the fact > that the user-space binary may have been built with user-space > headers where 'struct seccomp_notif_resp' is bigger than the > response buffer expected by the (older) kernel. Therefore, we > allocate a buffer that is the maximum of the two sizes. This > ensures that if the supervisor places bytes into the response > structure that are past the response size that the kernel expects, > then the supervisor is not touching an invalid memory location. */ > > size_t resp_size = sizes.seccomp_notif_resp; > if (sizeof(struct seccomp_notif_resp) > resp_size) > resp_size = sizeof(struct seccomp_notif_resp); > > struct seccomp_notif_resp *resp = malloc(resp_size); > if (resp == NULL) > errExit("\tS: malloc"); > > /* Loop handling notifications */ > > for (;;) { > /* Wait for next notification, returning info in '*req' */ > > memset(req, 0, sizes.seccomp_notif); > if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) { > if (errno == EINTR) > continue; > errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV"); > } > > printf("\tS: got notification (ID %#llx) for PID %d\n", > req->id, req->pid); > > /* The only system call that can generate a notification event > is mkdir(2). Nevertheless, we check that the notified system > call is indeed mkdir() as kind of future-proofing of this > code in case the seccomp filter is later modified to > generate notifications for other system calls. */ > > if (req->data.nr != __NR_mkdir) { > printf("\tS: notification contained unexpected " > "system call number; bye!!!\n"); > exit(EXIT_FAILURE); > } > > bool pathOK = getTargetPathname(req, notifyFd, path, > sizeof(path)); > > /* Prepopulate some fields of the response */ > > resp->id = req->id; /* Response includes notification ID */ > resp->flags = 0; > resp->val = 0; > > /* If the target pathname was not valid, trigger an EINVAL error; > if the directory is in /tmp, then create it on behalf of the > supervisor; if the pathname starts with '.', tell the kernel > to let the target process execute the mkdir(); otherwise, give > an error for a directory pathname in any other location. */ > > if (!pathOK) { > resp->error = -EINVAL; > printf("\tS: spoofing error for invalid pathname (%s)\n", > strerror(-resp->error)); > } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) { > printf("\tS: executing: mkdir(\"%s\", %#llo)\n", > path, req->data.args[1]); > > if (mkdir(path, req->data.args[1]) == 0) { > resp->error = 0; /* "Success" */ > resp->val = strlen(path); /* Used as return value of > mkdir() in target */ > printf("\tS: success! spoofed return = %lld\n", > resp->val); > } else { > > /* If mkdir() failed in the supervisor, pass the error > back to the target */ > > resp->error = -errno; > printf("\tS: failure! (errno = %d; %s)\n", errno, > strerror(errno)); > } > } else if (strncmp(path, "./", strlen("./")) == 0) { > resp->error = resp->val = 0; > resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE; > printf("\tS: target can execute system call\n"); > } else { > resp->error = -EOPNOTSUPP; > printf("\tS: spoofing error response (%s)\n", > strerror(-resp->error)); > } > > /* Send a response to the notification */ > > printf("\tS: sending response " > "(flags = %#x; val = %lld; error = %d)\n", > resp->flags, resp->val, resp->error); > > if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) { > if (errno == ENOENT) > printf("\tS: response failed with ENOENT; " > "perhaps target process's syscall was " > "interrupted by a signal?\n"); > else > perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND"); > } > > /* If the pathname is just "/bye", then the supervisor > terminates. This allows us to see what happens if the > target process makes further calls to mkdir(2). */ > > if (strcmp(path, "/bye") == 0) { > printf("\tS: terminating **********\n"); > exit(EXIT_FAILURE); > } > } > } > > /* Implementation of the supervisor process: > > (1) obtains the notification file descriptor from 'sockPair[1]' > (2) handles notifications that arrive on that file descriptor. */ > > static void > supervisor(int sockPair[2]) > { > int notifyFd = recvfd(sockPair[1]); > if (notifyFd == -1) > errExit("recvfd"); > > closeSocketPair(sockPair); /* We no longer need the socket pair */ > > handleNotifications(notifyFd); > } > > int > main(int argc, char *argv[]) > { > int sockPair[2]; > > setbuf(stdout, NULL); > > if (argc < 2) { > fprintf(stderr, "At least one pathname argument is required\n"); > exit(EXIT_FAILURE); > } > > /* Create a UNIX domain socket that is used to pass the seccomp > notification file descriptor from the target process to the > supervisor process. */ > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1) > errExit("socketpair"); > > /* Create a child process--the "target"--that installs seccomp > filtering. The target process writes the seccomp notification > file descriptor onto 'sockPair[0]' and then calls mkdir(2) for > each directory in the command-line arguments. */ > > (void) targetProcess(sockPair, &argv[optind]); > > /* Catch SIGCHLD when the target terminates, so that the > supervisor can also terminate. */ > > struct sigaction sa; > sa.sa_handler = sigchldHandler; > sa.sa_flags = 0; > sigemptyset(&sa.sa_mask); > if (sigaction(SIGCHLD, &sa, NULL) == -1) > errExit("sigaction"); > > supervisor(sockPair); > > exit(EXIT_SUCCESS); > } > > SEE ALSO > ioctl(2), seccomp(2) > > A further example program can be found in the kernel source file > samples/seccomp/user-trap.c. > > Linux 2020-10-01 SECCOMP_USER_NOTIF(2) Thank you so much for this documentation and example! :) -Kees [1] https://git.kernel.org/linus/dfe719fef03d752f1682fa8aeddf30ba501c8555 [2] https://lore.kernel.org/lkml/CAG48ez3kpEDO1x_HfvOM2R9M78Ach9O_4+Pjs-vLLfqvZL+13A@xxxxxxxxxxxxxx/ [3] https://lore.kernel.org/lkml/CAGXu5jKzif=vp6gn5ZtrTx-JTN367qFphobnt9s=awbaafwoUw@xxxxxxxxxxxxxx/ -- Kees Cook _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers