Re: For review: seccomp_user_notif(2) manual page

Jann Horn <jannh@xxxxxxxxxx> · Thu, 1 Oct 2020 17:47:54 +0200

On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
<christian.brauner@xxxxxxxxxxxxx> wrote:
> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@xxxxxxxxx> wrote:
> > > NOTES
> > >        The file descriptor returned when seccomp(2) is employed with the
> > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > >        ing,  these interfaces indicate that the file descriptor is read‐
> > >        able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> >  * same syscall, the most recently added filter takes precedence. This means
> >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> >  * such filtered syscalls to be executed by sending the response
> >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp(). This should also be noted over in the main
> > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>
> So I was actually wondering about this when I skimmed this and a while
> ago but forgot about this again... Afaict, you can only ever load a
> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> in the tasks filter hierarchy then the kernel will refuse to load a new
> one?
>
> static struct file *init_listener(struct seccomp_filter *filter)
> {
>         struct file *ret = ERR_PTR(-EBUSY);
>         struct seccomp_filter *cur;
>
>         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
>                 if (cur->notif)
>                         goto out;
>         }
>
> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> override each other for the same task simply because there can only ever
> be a single one?

Good point. Exceeeept that that check seems ineffective because this
happens before we take the locks that guard against TSYNC, and also
before we decide to which existing filter we want to chain the new
filter. So if two threads race with TSYNC, I think they'll be able to
chain two filters with listeners together.

I don't know whether we want to eternalize this "only one listener
across all the filters" restriction in the manpage though, or whether
the man page should just say that the kernel currently doesn't support
it but that security-wise you should assume that it might at some
point.

[...]
> > >            if (procMemFd == -1)
> > >                errExit("Supervisor: open");
> > >
> > >            /* Check that the process whose info we are accessing is still alive.
> > >               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > >               in checkNotificationIdIsValid()) succeeds, we know that the
> > >               /proc/PID/mem file descriptor that we opened corresponds to the
> > >               process for which we received a notification. If that process
> > >               subsequently terminates, then read() on that file descriptor
> > >               will return 0 (EOF). */
> > >
> > >            checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > >            /* Seek to the location containing the pathname argument (i.e., the
> > >               first argument) of the mkdir(2) call and read that pathname */
> > >
> > >            if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> > >                errExit("Supervisor: lseek");
> > >
> > >            ssize_t s = read(procMemFd, path, PATH_MAX);
> > >            if (s == -1)
> > >                errExit("read");
> >
> > Why not pread() instead of lseek()+read()?
>
> With multiple arguments to be read process_vm_readv() should also be
> considered.

process_vm_readv() can end up doing each read against a different
process, which is sort of weird semantically. You would end up taking
page faults at random addresses in unrelated processes, blocking on
their mmap locks, potentially triggering their userfaultfd notifiers,
and so on.

Whereas if you first open /proc/$tid/mem, then re-check
SECCOMP_IOCTL_NOTIF_ID_VALID, and then do the read, you know that
you're only taking page faults on the process where you intended to do
it.

So until there is a variant of process_vm_readv() that operates on
pidfds, I would not recommend using that here.