On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote: > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages) > <mtk.manpages@xxxxxxxxx> wrote: > > I knew it would be a big ask, but below is kind of the manual page > > I was hoping you might write [1] for the seccomp user-space notification > > mechanism. Since you didn't (and because 5.9 adds various new pieces > > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD > > that also will need documenting [2]), I did :-). But of course I may > > have made mistakes... > [...] > > NAME > > seccomp_user_notif - Seccomp user-space notification mechanism > > > > SYNOPSIS > > #include <linux/seccomp.h> > > #include <linux/filter.h> > > #include <linux/audit.h> > > > > int seccomp(unsigned int operation, unsigned int flags, void *args); > > Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS > of the ioctl_* manpages? > > > DESCRIPTION > > This page describes the user-space notification mechanism pro‐ > > vided by the Secure Computing (seccomp) facility. As well as the > > use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐ > > COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES > > operation described in seccomp(2), this mechanism involves the > > use of a number of related ioctl(2) operations (described below). > > > > Overview > > In conventional usage of a seccomp filter, the decision about how > > to treat a particular system call is made by the filter itself. > > The user-space notification mechanism allows the handling of the > > system call to instead be handed off to a user-space process. > > The advantages of doing this are that, by contrast with the sec‐ > > comp filter, which is running on a virtual machine inside the > > kernel, the user-space process has access to information that is > > unavailable to the seccomp filter and it can perform actions that > > can't be performed from the seccomp filter. > > > > In the discussion that follows, the process that has installed > > the seccomp filter is referred to as the target, and the process > > Technically, this definition of "target" is a bit inaccurate because: > > - seccomp filters are inherited > - seccomp filters apply to threads, not processes > - seccomp filters can be semi-remotely installed via TSYNC > > (I assume that in manpages, we should try to go for the "a task is a > thread and a thread group is a process" definition, right?) > > Perhaps "the threads on which the seccomp filter is installed are > referred to as the target", or something like that would be better? > > > that is notified by the user-space notification mechanism is > > referred to as the supervisor. An overview of the steps per‐ > > formed by these two processes is as follows: > > > > 1. The target process establishes a seccomp filter in the usual > > manner, but with two differences: > > > > · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐ > > TER_FLAG_NEW_LISTENER. Consequently, the return value of > > the (successful) seccomp(2) call is a new "listening" file > > descriptor that can be used to receive notifications. > > > > · In cases where it is appropriate, the seccomp filter returns > > the action value SECCOMP_RET_USER_NOTIF. This return value > > will trigger a notification event. > > > > 2. In order that the supervisor process can obtain notifications > > using the listening file descriptor, (a duplicate of) that > > file descriptor must be passed from the target process to the > > supervisor process. One way in which this could be done is by > > passing the file descriptor over a UNIX domain socket connec‐ > > tion between the two processes (using the SCM_RIGHTS ancillary > > message type described in unix(7)). Another possibility is > > that the supervisor might inherit the file descriptor via > > fork(2). > > With the caveat that if the supervisor inherits the file descriptor > via fork(), that (more or less) implies that the supervisor is subject > to the same filter (although it could bypass the filter using a helper > thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't > expect any clean software to do that). > > > 3. The supervisor process will receive notification events on the > > listening file descriptor. These events are returned as > > structures of type seccomp_notif. Because this structure and > > its size may evolve over kernel versions, the supervisor must > > first determine the size of this structure using the sec‐ > > comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a > > structure of type seccomp_notif_sizes. The supervisor allo‐ > > cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes > > to receive notification events. In addition,the supervisor > > allocates another buffer of size seccomp_notif_sizes.sec‐ > > comp_notif_resp bytes for the response (a struct sec‐ > > comp_notif_resp structure) that it will provide to the kernel > > (and thus the target process). > > > > 4. The target process then performs its workload, which includes > > system calls that will be controlled by the seccomp filter. > > Whenever one of these system calls causes the filter to return > > the SECCOMP_RET_USER_NOTIF action value, the kernel does not > > execute the system call; instead, execution of the target > > process is temporarily blocked inside the kernel and a notifi‐ > > where "blocked" refers to the interruptible, restartable kind - if the > child receives a signal with an SA_RESTART signal handler in the > meantime, it'll leave the syscall, go through the signal handler, then > restart the syscall again and send the same request to the supervisor > again. so the supervisor may see duplicate syscalls. > > What's really gross here is that signal(7) promises that some syscalls > like epoll_wait(2) never restart, but seccomp doesn't know about that; > if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a > non-restartable syscall, the result is that UAPI gets broken a little > bit. Luckily normal users of seccomp probably won't use > SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does > want to do that, we might have to add some "suppress syscall > restarting" flag into the seccomp action value, or something like > that... yuck. > > > cation event is generated on the listening file descriptor. > > > > 5. The supervisor process can now repeatedly monitor the listen‐ > > ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered > > events. To do this, the supervisor uses the SEC‐ > > COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information > > about a notification event; this operation blocks until an > > (interruptably - but I guess that maybe doesn't have to be said > explicitly here?) > > > event is available. > > Maybe we should note here that you can use the multi-fd-polling APIs > (select/poll/epoll) instead, and that if the notification goes away > before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return > -ENOENT instead of blocking, and therefore as long as nobody else > reads from the same fd, you can assume that after the fd reports as > readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking. > > Exceeeeept that this part looks broken: > > if (mutex_lock_interruptible(&filter->notify_lock) < 0) > return EPOLLERR; > > which I think means that we can have a race where a signal arrives > while poll() is trying to add itself to the waitqueue of the seccomp > fd, and then we'll get a spurious error condition reported on the fd. > That's a kernel bug, I'd say. > > > The operation returns a seccomp_notif > > structure containing information about the system call that is > > being attempted by the target process. > > > > 6. The seccomp_notif structure returned by the SEC‐ > > COMP_IOCTL_NOTIF_RECV operation includes the same information > > (a seccomp_data structure) that was passed to the seccomp fil‐ > > ter. This information allows the supervisor to discover the > > system call number and the arguments for the target process's > > system call. In addition, the notification event contains the > > PID of the target process. > > That's a PIDTYPE_PID, which the manpages call a "thread ID". > > > The information in the notification can be used to discover > > the values of pointer arguments for the target process's sys‐ > > tem call. (This is something that can't be done from within a > > seccomp filter.) To do this (and assuming it has suitable > > permissions), the supervisor opens the corresponding > > /proc/[pid]/mem file, > > ... which means that here we might have to get into the weeds of how > actually /proc has invisible directories for every TID, even though > only the ones for PIDs are visible, and therefore you can just open > /proc/[tid]/mem and it'll work fine? > > > seeks to the memory location that corre‐ > > sponds to one of the pointer arguments whose value is supplied > > in the notification event, and reads bytes from that location. > > (The supervisor must be careful to avoid a race condition that > > can occur when doing this; see the description of the SEC‐ > > COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐ > > tion, the supervisor can access other system information that > > is visible in user space but which is not accessible from a > > seccomp filter. > > > > ┌─────────────────────────────────────────────────────┐ > > │FIXME │ > > ├─────────────────────────────────────────────────────┤ > > │Suppose we are reading a pathname from /proc/PID/mem │ > > │for a system call such as mkdir(). The pathname can │ > > │be an arbitrary length. How do we know how much (how │ > > │many pages) to read from /proc/PID/mem? │ > > └─────────────────────────────────────────────────────┘ > > It can't be an arbitrary length. While pathnames *returned* from the > kernel in some places can have different limits, strings supplied as > path arguments *to* the kernel AFAIK always have an upper limit of > PATH_MAX, else you get -ENAMETOOLONG. See getname_flags(). > > > 7. Having obtained information as per the previous step, the > > supervisor may then choose to perform an action in response to > > the target process's system call (which, as noted above, is > > not executed when the seccomp filter returns the SEC‐ > > COMP_RET_USER_NOTIF action value). > > (unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used) > > > One example use case here relates to containers. The target > > process may be located inside a container where it does not > > have sufficient capabilities to mount a filesystem in the con‐ > > tainer's mount namespace. However, the supervisor may be a > > more privileged process that that does have sufficient capa‐ > > nit: s/that that/that/ > > > bilities to perform the mount operation. > > > > 8. The supervisor then sends a response to the notification. The > > information in this response is used by the kernel to con‐ > > struct a return value for the target process's system call and > > provide a value that will be assigned to the errno variable of > > the target process. > > > > The response is sent using the SECCOMP_IOCTL_NOTIF_RECV > > ioctl(2) operation, which is used to transmit a sec‐ > > comp_notif_resp structure to the kernel. This structure > > includes a cookie value that the supervisor obtained in the > > seccomp_notif structure returned by the SEC‐ > > COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the > > kernel to associate the response with the target process. > > (unless if the target thread entered a signal handler or was killed in > the meantime) > > > 9. Once the notification has been sent, the system call in the > > target process unblocks, returning the information that was > > provided by the supervisor in the notification response. > > > > As a variation on the last two steps, the supervisor can send a > > response that tells the kernel that it should execute the target > > process's system call; see the discussion of SEC‐ > > COMP_USER_NOTIF_FLAG_CONTINUE, below. > > > > ioctl(2) operations > > The following ioctl(2) operations are provided to support seccomp > > user-space notification. For each of these operations, the first > > (file descriptor) argument of ioctl(2) is the listening file > > descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐ > > TER_FLAG_NEW_LISTENER flag. > > > > SECCOMP_IOCTL_NOTIF_RECV > > This operation is used to obtain a user-space notification > > event. If no such event is currently pending, the opera‐ > > tion blocks until an event occurs. > > Not necessarily; for every time a process entered a signal handler or > was killed while a notification was pending, a call to > SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT. > > > The third ioctl(2) > > argument is a pointer to a structure of the following form > > which contains information about the event. This struc‐ > > ture must be zeroed out before the call. > > > > struct seccomp_notif { > > __u64 id; /* Cookie */ > > __u32 pid; /* PID of target process */ > > (TID, not PID) > > > __u32 flags; /* Currently unused (0) */ > > struct seccomp_data data; /* See seccomp(2) */ > > }; > > > > The fields in this structure are as follows: > > > > id This is a cookie for the notification. Each such > > cookie is guaranteed to be unique for the corre‐ > > sponding seccomp filter. In other words, this > > cookie is unique for each notification event from > > the target process. > > That sentence about "target process" looks wrong to me. The cookies > are unique across notifications from the filter, but there can be > multiple filters per thread, and multiple threads per filter. > > > The cookie value has the fol‐ > > lowing uses: > > > > · It can be used with the SEC‐ > > COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to > > verify that the target process is still alive. > > > > · When returning a notification response to the > > kernel, the supervisor must include the cookie > > value in the seccomp_notif_resp structure that is > > specified as the argument of the SEC‐ > > COMP_IOCTL_NOTIF_SEND operation. > > > > pid This is the PID of the target process that trig‐ > > gered the notification event. > > > > ┌─────────────────────────────────────────────────────┐ > > │FIXME │ > > ├─────────────────────────────────────────────────────┤ > > │This is a thread ID, rather than a PID, right? │ > > └─────────────────────────────────────────────────────┘ > > Yeah. > > > > > flags This is a bit mask of flags providing further > > information on the event. In the current implemen‐ > > tation, this field is always zero. > > > > data This is a seccomp_data structure containing infor‐ > > mation about the system call that triggered the > > notification. This is the same structure that is > > passed to the seccomp filter. See seccomp(2) for > > details of this structure. > > > > On success, this operation returns 0; on failure, -1 is > > returned, and errno is set to indicate the cause of the > > error. This operation can fail with the following errors: > > > > EINVAL (since Linux 5.5) > > The seccomp_notif structure that was passed to the > > call contained nonzero fields. > > > > ENOENT The target process was killed by a signal as the > > notification information was being generated. > > Not just killed, interruption with a signal handler has the same effect. > > > ┌─────────────────────────────────────────────────────┐ > > │FIXME │ > > ├─────────────────────────────────────────────────────┤ > > │From my experiments, it appears that if a SEC‐ │ > > │COMP_IOCTL_NOTIF_RECV is done after the target │ > > │process terminates, then the ioctl() simply blocks │ > > │(rather than returning an error to indicate that the │ > > │target process no longer exists). │ > > │ │ > > │I found that surprising, and it required some con‐ │ > > │tortions in the example program. It was not possi‐ │ > > │ble to code my SIGCHLD handler (which reaps the zom‐ │ > > │bie when the worker/target process terminates) to │ > > │simply set a flag checked in the main handleNotifi‐ │ > > │cations() loop, since this created an unavoidable │ > > │race where the child might terminate just after I │ > > │had checked the flag, but before I blocked (for‐ │ > > │ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │ > > │Instead, I had to code the signal handler to simply │ > > │call _exit(2) in order to terminate the parent │ > > │process (the supervisor). │ > > │ │ > > │Is this expected behavior? It seems to me rather │ > > │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │ > > │an error if the target process has terminated. │ > > └─────────────────────────────────────────────────────┘ > > You could poll() the fd first. But yeah, it'd probably be a good idea > to change that. > > > SECCOMP_IOCTL_NOTIF_ID_VALID > [...] > > In the above scenario, the risk is that the supervisor may > > try to access the memory of a process other than the tar‐ > > get. This race can be avoided by following the call to > > open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐ > > ify that the process that generated the notification is > > still alive. (Note that if the target process subse‐ > > quently terminates, its PID won't be reused because there > > That's wrong, the PID can be reused, but the /proc/$pid directory is > internally not associated with the numeric PID, but, conceptually > speaking, with a specific incarnation of the PID, or something like > that. (Actually, it is associated with the "struct pid", which is not > reused, instead of the numeric PID.) > > > remains an open reference to the /proc[pid]/mem file; in > > this case, a subsequent read(2) from the file will return > > 0, indicating end of file.) > > > > On success (i.e., the notification ID is still valid), > > this operation returns 0 On failure (i.e., the notifica‐ > > nit: s/returns 0/returns 0./ > > > tion ID is no longer valid), -1 is returned, and errno is > > set to ENOENT. > > > > SECCOMP_IOCTL_NOTIF_SEND > [...] > > Two kinds of response are possible: > > > > · A response to the kernel telling it to execute the tar‐ > > get process's system call. In this case, the flags > > field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the > > error and val fields must be zero. > > > > This kind of response can be useful in cases where the > > supervisor needs to do deeper analysis of the target's > > system call than is possible from a seccomp filter > > (e.g., examining the values of pointer arguments), and, > > having verified that the system call is acceptable, the > > supervisor wants to allow it to proceed. > > "allow" sounds as if this is an access control thing, but this > mechanism should usually not be used for access control (unless the > "seccomp" syscall is blocked). Maybe reword as "having decided that > the system call does not require emulation by the supervisor, the > supervisor wants it to execute normally", or something like that? > > [...] > > On success, this operation returns 0; on failure, -1 is > > returned, and errno is set to indicate the cause of the > > error. This operation can fail with the following errors: > > > > EINPROGRESS > > A response to this notification has already been > > sent. > > > > EINVAL An invalid value was specified in the flags field. > > > > EINVAL The flags field contained SEC‐ > > COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val > > field was not zero. > > > > ENOENT The blocked system call in the target process has > > been interrupted by a signal handler. > > (you could also get this if a response has already been sent, instead > of EINPROGRESS - the only difference is whether the target thread has > picked up the response yet) > > > NOTES > > The file descriptor returned when seccomp(2) is employed with the > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using > > poll(2), epoll(7), and select(2). When a notification is pend‐ > > ing, these interfaces indicate that the file descriptor is read‐ > > able. > > We should probably also point out somewhere that, as > include/uapi/linux/seccomp.h says: > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the > * same syscall, the most recently added filter takes precedence. This means > * that the new SECCOMP_RET_USER_NOTIF filter can override any > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all > * such filtered syscalls to be executed by sending the response > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE. > > In other words, from a security perspective, you must assume that the > target process can bypass any SECCOMP_RET_USER_NOTIF (or > SECCOMP_RET_TRACE) filters unless it is completely prohibited from > calling seccomp(). This should also be noted over in the main > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part. So I was actually wondering about this when I skimmed this and a while ago but forgot about this again... Afaict, you can only ever load a single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property in the tasks filter hierarchy then the kernel will refuse to load a new one? static struct file *init_listener(struct seccomp_filter *filter) { struct file *ret = ERR_PTR(-EBUSY); struct seccomp_filter *cur; for (cur = current->seccomp.filter; cur; cur = cur->prev) { if (cur->notif) goto out; } shouldn't that be sufficient to guarantee that USER_NOTIF filters can't override each other for the same task simply because there can only ever be a single one? > > > > EXAMPLES > [...] > > This program can used to demonstrate various aspects of the > > nit: "can be used to demonstrate", or alternatively just "demonstrates" > > > behavior of the seccomp user-space notification mechanism. To > > help aid such demonstrations, the program logs various messages > > to show the operation of the target process (lines prefixed "T:") > > and the supervisor (indented lines prefixed "S:"). > [...] > > Program source > [...] > > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ > > } while (0) > > Don't we have err() for this? > > > /* Send the file descriptor 'fd' over the connected UNIX domain socket > > 'sockfd'. Returns 0 on success, or -1 on error. */ > > > > static int > > sendfd(int sockfd, int fd) > > { > > struct msghdr msgh; > > struct iovec iov; > > int data; > > struct cmsghdr *cmsgp; > > > > /* Allocate a char array of suitable size to hold the ancillary data. > > However, since this buffer is in reality a 'struct cmsghdr', use a > > union to ensure that it is suitable aligned. */ > > nit: suitably > > > union { > > char buf[CMSG_SPACE(sizeof(int))]; > > /* Space large enough to hold an 'int' */ > > struct cmsghdr align; > > } controlMsg; > > > > /* The 'msg_name' field can be used to specify the address of the > > destination socket when sending a datagram. However, we do not > > need to use this field because 'sockfd' is a connected socket. */ > > > > msgh.msg_name = NULL; > > msgh.msg_namelen = 0; > > > > /* On Linux, we must transmit at least one byte of real data in > > order to send ancillary data. We transmit an arbitrary integer > > whose value is ignored by recvfd(). */ > > > > msgh.msg_iov = &iov; > > msgh.msg_iovlen = 1; > > iov.iov_base = &data; > > iov.iov_len = sizeof(int); > > data = 12345; > > > > /* Set 'msghdr' fields that describe ancillary data */ > > > > msgh.msg_control = controlMsg.buf; > > msgh.msg_controllen = sizeof(controlMsg.buf); > > > > /* Set up ancillary data describing file descriptor to send */ > > > > cmsgp = CMSG_FIRSTHDR(&msgh); > > cmsgp->cmsg_level = SOL_SOCKET; > > cmsgp->cmsg_type = SCM_RIGHTS; > > cmsgp->cmsg_len = CMSG_LEN(sizeof(int)); > > memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int)); > > > > /* Send real plus ancillary data */ > > > > if (sendmsg(sockfd, &msgh, 0) == -1) > > return -1; > > > > return 0; > > } > > Instead of using unix domain sockets to send the fd to the parent, I > think you could also use clone3() with flags==CLONE_FILES|SIGCHLD, > dup2() the seccomp fd to an fd that was reserved in the parent, call > unshare(CLONE_FILES) in the child after setting up the seccomp fd, and > wake up the parent with something like pthread_cond_signal()? I'm not > sure whether that'd look better or worse in the end though, so maybe > just ignore this comment. (If the target process exec's (rather fast) then VFORK can be useful.) > > [...] > > /* Access the memory of the target process in order to discover the > > pathname that was given to mkdir() */ > > > > static void > > getTargetPathname(struct seccomp_notif *req, int notifyFd, > > char *path, size_t len) > > { > > char procMemPath[PATH_MAX]; > > snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid); > > > > int procMemFd = open(procMemPath, O_RDONLY); > > Should example code like this maybe use O_CLOEXEC unless the fd in > question actually has to be inheritable? I know it doesn't actually > matter here, but if this code was used in a multi-threaded context, it > might. Agreed, about the O_CLOEXEC part. > > > if (procMemFd == -1) > > errExit("Supervisor: open"); > > > > /* Check that the process whose info we are accessing is still alive. > > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed > > in checkNotificationIdIsValid()) succeeds, we know that the > > /proc/PID/mem file descriptor that we opened corresponds to the > > process for which we received a notification. If that process > > subsequently terminates, then read() on that file descriptor > > will return 0 (EOF). */ > > > > checkNotificationIdIsValid(notifyFd, req->id); > > > > /* Seek to the location containing the pathname argument (i.e., the > > first argument) of the mkdir(2) call and read that pathname */ > > > > if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1) > > errExit("Supervisor: lseek"); > > > > ssize_t s = read(procMemFd, path, PATH_MAX); > > if (s == -1) > > errExit("read"); > > Why not pread() instead of lseek()+read()? With multiple arguments to be read process_vm_readv() should also be considered.