On Tue, Sep 01, 2020 at 04:43:09PM +0800, Hillf Danton wrote: > Below is my 2c for making it safe to access the rbtree entry and the > file tucked in it by bumping up the file count before adding epi into > rbtree. NAK. epitems, by design, do *NOT* pin files down. That's what eventpoll_release() is about - those references are not counting and epitems can be taken out by the final close. It's a user-visible aspect of API. And the problem Marc's patch was trying to solve had nothing to do with that - at the root it's the lack of suitable exclusion and the atrocious way loop prevention and reverse path count checks are done on watch insertion. What happens there is that files that would have paths added by EPOLL_CTL_ADD (including those via several intermediate epoll files) are collected into a temporary list and then checked for excessive reverse paths. The list is emptied before epoll_ctl() returns. And that's _the_ list - if epoll_ctl decides to go there in the first place, it grabs a system-wide mutex and holds it as long as it's playing around. Everything would've worked, if not for the fact that * the bloody list goes through struct file instances. It's a bad decision, for a lot of reasons. * in particular, files can go away while epoll_ctl() is playing around. The same system-wide mutex would've blocked their freeing (modulo a narrow race in eventpoll_release()) *IF* they remained attached to epitems. However, explicit EPOLL_CTL_DEL removing such a beast in the middle of EPOLL_CTL_ADD checks will remove such epitems without touching the same mutex. Marc's patch tried to brute-force the protection of files in that temporary list; what it had missed was the fact that a file on it could have already been committed to getting killed - f_count already zero, just hadn't gotten through the __fput() yet. In such cases we don't want to do any reverse path checks for that sucker - it *is* going away, so there's no point considering it. We can't blindly bump the refcount, though, for obvious reasons. That struct file can't get freed right under the code inserting it into the list - the locks held at the moment (ep->mtx) are more than enough. It's what can happen to it while on the list that is a problem. Of course, as a long-term solution we want to have that crap with temporary list going through struct file instances taken out and moved to epitems themselves; the check does scan all epitems for every file in the set and we bloody well could gather those into the list at once. Then we'd only need to protect the list walking vs. removals (with a spinlock, for all we care). It's too invasive a change for -stable, though, and I'm still digging through fs/eventpoll.c locking - it's much too convoluted and it got no attention for a decade ;-/