Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 6 May 2024 16:19:06 +0200

On Mon, May 06, 2024 at 11:27:04AM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote:
> > > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> > 
> > I agree that epoll() not taking a reference on the file is at least
> > unexpected and contradicts the usual code patterns for the sake of
> > performance and that it very likely is the case that most callers of
> > f_op->poll() don't know this.
> > 
> > Note, I cleary wrote upthread that I'm ok to do it like you suggested
> > but raised two concerns a) there's currently only one instance of
> > prolonged @file lifetime in f_op->poll() afaict and b) that there's
> > possibly going to be some performance impact on epoll().
> > 
> > So it's at least worth discussing what's more important because epoll()
> > is very widely used and it's not that we haven't favored performance
> > before.
> > 
> > But you've already said that you aren't concerned with performance on
> > epoll() upthread. So afaict then there's really not a lot more to
> > discuss other than take the patch and see whether we get any complaints.
> 
> Two closing thoughts:
> 
> (1) I wonder if this won't cause userspace regressions for the semantics
>     of epoll because dying files are now silently ignored whereas before
>     they'd generated events.
> 
> (2) The other part is that this seems to me that epoll() will now
>     temporarly pin filesystems opening up the possibility for spurious
>     EBUSY errors.
> 
>     If you register a file descriptor in an epoll instance and then
>     close it and umount the filesystem but epoll managed to do an fget()
>     on that fd before that close() call then epoll will pin that
>     filesystem.
> 
>     If the f_op->poll() method does something that can take a while
>     (blocks on a shared mutex of that subsystem) that umount is very
>     likely going to return EBUSY suddenly.
> 
>     Afaict, before that this wouldn't have been an issue at all and is
>     likely more serious than performance.
> 
>     (One option would be to only do epi_fget() for stuff like
>     dma-buf that's never unmounted. That'll cover nearly every
>     driver out there. Only "real" filesystems would have to contend with
>     @file count going to zero but honestly they also deal with dentry
>     lookup under RCU which is way more adventurous than this.)
> 
>     Maybe I'm barking up the wrong tree though.

Sorry, had to step out for an appointment.

Under the assumption that I'm not entirely off with this - and I really
could be ofc - then one possibility would be that we enable persistence
of files from f_op->poll() for SB_NOUSER filesystems.

That'll catch everything that's relying on anonymous inodes (drm and all
drivers) and init_pseudo() so everything that isn't actually unmountable
(pipefs, pidfs, sockfs, etc.).

So something like the _completely untested_ diff on top of your proposal
above:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3f0f868adc4..95968a462544 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1018,8 +1018,24 @@ static struct file *epi_fget(const struct epitem *epi)
 static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
                                 int depth)
 {
-       struct file *file = epi_fget(epi);
+       struct file *file = epi->ffd.file;
        __poll_t res;
+       bool unrefd = false;
+
+       /*
+        * Taking a reference for anything that isn't mountable is fine
+        * because we don't have to worry about spurious EBUSY warnings
+        * from umount().
+        *
+        * File count might go to zero in f_op->poll() for mountable
+        * filesystems.
+        */
+       if (file->f_inode->i_sb->s_flags & SB_NOUSER) {
+               unrefd = true;
+               file = epi_fget(epi);
+       } else if (file_count(file) == 0) {
+               file = NULL;
+       }

        /*
         * We could return EPOLLERR | EPOLLHUP or something,
@@ -1034,7 +1050,9 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
                res = vfs_poll(file, pt);
        else
                res = __ep_eventpoll_poll(file, pt, depth);
-       fput(file);
+
+       if (unrefd)
+               fput(file);
        return res & epi->event.events;
 }

Basically, my worry is that we end up with really annoying to debug
EBUSYs caused by epoll(). I'd really like to avoid that. But again, I
might be wrong and this isn't an issue.