On Mon, 6 May 2024 at 10:46, Stefan Metzmacher <metze@xxxxxxxxx> wrote: > > I think it's a very important detail that epoll does not take > real references. Otherwise an application level 'close()' on a socket > would not trigger a tcp disconnect, when an fd is still registered with > epoll. Yes, exactly. epoll() ends up actually wanting the lifetime of the ep item be the lifetime of the file _descriptor_, not the lifetime of the file itself. We approximate that - badly - with epoll not taking a reference on the file pointer, and then at final fput() it tears things down. But that has two real issues, and one of them is that "oh, now epoll has file pointers that are actually dead" that caused this thread. The other issue is that "approximates" thing, where it means that duplicating the file pointer (dup*() and fork() end unix socket file sending etc) will not mean that the epoll ref is also out of sync with the lifetime of the file descriptor. That's why I suggested that "clean up epoll references at file_close_fd() time instead: https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zMZaSg@xxxxxxxxxxxxxx/ because it would actually really *fix* the lifetime issue of ep items. In the process, it would make it possible to actually take a f_count reference at EPOLL_CTL_ADD time, since now the lifetime of the EP wouldn't be tied to the lifetime of the 'struct file *' pointer, it would be properly tied to the lifetime of the actual file descriptor that you are adding. So it would be a huge conceptual cleanup too. (Of course - at that point EPOLL_CTL_ADD still doesn't actually _need_ a reference to the file, since the file being open in itself is already that reference - but the point here being that there would *be* a reference that the epoll code would effectively have, and you'd never be in the situation we were in where there would be stale "dead" file pointers that just haven't gone through the cleanup yet). But I'd rather not touch the epoll code more than I have to. Which is why I applied the minimal patch for just "refcount over vfs_poll()", and am just mentioning my suggestion in the hope that some eager beaver would like to see how painful it would do to make the bigger surgery... Linus