Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sun, 5 May 2024 20:46:03 +0100

On Sat, May 04, 2024 at 08:53:47AM -0700, Linus Torvalds wrote:

>   poll_wait
>     -> __pollwait
>      -> get_file (*boom*)
> 
> but the boom is very small because the poll_wait() will be undone by
> poll_freewait(), and normally poll/select has held the file count
> elevated.

Not quite.  It's not that poll_wait() calls __pollwait(); it calls
whatever callback that caller of ->poll() has set for it.

__pollwait users (select(2) and poll(2), currently) must (and do) make
sure that refcount is elevated; others (and epoll is not the only one)
need to do whatever's right for their callbacks.

I've no problem with having epoll grab a reference, but if we make that
a universal requirement ->poll() instances can rely upon, we'd better
verify that *all* vfs_poll() are OK.  And that ought to go into
Documentation/filesystems/porting.rst ("callers of vfs_poll() must
make sure that file is pinned; ->poll() instances may rely upon that,
but they'd better be very careful about grabbing extra references themselves -
it's acceptable for files on internal mounts, but *NOT* for anything on
mountable filesystems.  Any instance that does it needs an explicit
comment telling not to reuse that blindly." or something along those
lines).

Excluding epoll, select/poll and callers that have just done fdget() and will
do fdput() after vfs_poll(), we have this:

drivers/vhost/vhost.c:213:      mask = vfs_poll(file, &poll->table);
	vhost_poll_start().  Might get interesting...  Calls working
with vq->kick as file seem to rely upon vq->mutex, but I'll need to
refresh my memories of that code to check if that's all we need - and
then there's vhost_net_enable_vq(), which also needs an audit.

fs/aio.c:1738:          mask = vfs_poll(req->file, &pt) & req->events;
fs/aio.c:1932:  mask = vfs_poll(req->file, &apt.pt) & req->events;
	aio_poll() and aio_poll_wake() resp.  req->file here is actually ->ki_filp
	of iocb that contains work as iocb->req.work; it get dropped only in
	iocb_destroy(), which also frees iocb.  Any call that might've run into
	req->file not pinned is already in UAF land.

io_uring/poll.c:303:                    req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
io_uring/poll.c:622:    mask = vfs_poll(req->file, &ipt->pt) & poll->events;
	Should have req->file pinned, but I'll need to RTFS a bit for
details.  That, or ask Jens to confirm...

net/9p/trans_fd.c:236:  ret = vfs_poll(ts->rd, pt);
net/9p/trans_fd.c:238:          ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) & ~EPOLLIN);
	p9_fd_poll(); ->rd and ->wr are pinned and won't get dropped until
p9_fd_close(), which frees ts immediately afterwards.  IOW, if we risk
being called with ->rd or ->wr not pinned, we are in UAF land already.
Incidentally, what the hell is this in p9_fd_open()?
         * It's technically possible for userspace or concurrent mounts to
         * modify this flag concurrently, which will likely result in a
         * broken filesystem. However, just having bad flags here should
         * not crash the kernel or cause any other sort of bug, so mark this
         * particular data race as intentional so that tooling (like KCSAN)
         * can allow it and detect further problems.
         */
Why not simply fix the race instead?  As in
	spin_lock(&ts->rd->f_lock);
        ts->rd->f_flags |= O_NONBLOCK;
	spin_unlock(&ts->rd->f_lock);
and similar for ts->wr?  Sigh...