On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote: > On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote: > > On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@xxxxxxxxxxxxx> > > wrote: > > > I just can't see how this race condition is XFS specific and why > > > fixing it requires XFS to sepcifically handle it while we ignore > > > similar theoretical issues in other filesystems... > > > > It is XFS specific, because all other filesystems RCU free the in- > > core > > inode after eviction. > > > > XFS is the only one that reuses the in-core inode object and that > > is > > very much different from anything the other filesystems do and what > > the VFS expects. > > Sure, but I was refering to the xfs_ifree issue that the patch > addressed, not the re-use issue that the *first patch addressed*. > > > I don't see how clearing the quick link buffer in > > ext4_evict_inode() > > could do anything bad. The contents are irrelevant, the lookup > > will > > be restarted anyway, the important thing is that the buffer is not > > freed and that it's null terminated, and both hold for the ext4, > > AFAICS. > > You miss the point (which, admittedly, probably wasn't clear). > > I suggested just zeroing the buffer in xfs_ifree instead of zeroing > it, which you seemed to suggest wouldn't work and we should move the > XFS functionality to .free_inode. That's what I was refering to as > "not being XFS specific" - if it is safe for ext4 to zero the link > buffer in .evict while lockless lookups can still be accessing the > link buffer, it is safe for XFS to do the same thing in .destroy > context. I'll need to think about that for a while. Zeroing the buffer while it's being used seems like a problem to me and was what this patch was trying to avoid. I thought all that would be needed for this to happen is for a dentry drop to occur while the link walk was happening after ->get_link() had returned the pointer. What have I got wrong in that thinking? > > If it isn't safe for ext4 to do that, then we have a general > pathwalk problem, not an XFS issue. But, as you say, it is safe to > do this zeroing, so the fix to xfs_ifree() is to zero the link > buffer instead of freeing it, just like ext4 does. > > As a side issue, we really don't want to move what XFS does in > .destroy_inode to .free_inode because that then means we need to add > synchronise_rcu() calls everywhere in XFS that might need to wait on > inodes being inactivated and/or reclaimed. And because inode reclaim > uses lockless rcu lookups, there's substantial danger of adding rcu > callback related deadlocks to XFS here. That's just not a direction > we should be moving in. Another reason I decided to use the ECHILD return instead is that I thought synchronise_rcu() might add an unexpected delay. Since synchronise_rcu() will only wait for processes that currently have the rcu read lock do you think that could actually be a problem in this code path? > > I'll also point out that this would require XFS inodes to pass > through *two* rcu grace periods before the memory they hold could be > freed because, as I mentioned, xfs inode reclaim uses rcu protected > inode lookups and so relies on inodes to be freed by rcu callback... > > > I tend to agree with Brian and Ian at this point: return -ECHILD > > from > > xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk > > implications are fully dealt with. No way to fix this from VFS > > alone. > > I disagree from a fundamental process POV - this is just sweeping > the issue under the table and leaving it for someone else to solve > because the root cause of the inode re-use issue has not been > identified. But to the person who architected the lockless XFS inode > cache 15 years ago, it's pretty obvious, so let's just solve it now. Sorry, I don't understand what you mean by the root cause not being identified? Until lockless path walking was introduced this wasn't a problem because references were held during walks so there could never be a final dput() to trigger freeing process during a walk. And a lot of code probably still makes that assumption. And code that does make that assumption should return -ECHILD in cases like this so that the VFS can either legitimize the struct path (by taking references) or restart the walk in ref-walk mode. Can you elaborate please? > > With the xfs_ifree() problem solved by zeroing rather than freeing, > then the only other problem is inode reuse *within an rcu grace > period*. Immediate inode reuse tends to be rare, (we can actually > trace occurrences to validate this assertion), and implementation > wise reuse is isolated to a single function: xfs_iget_recycle(). > > xfs_iget_recycle() drops the rcu_read_lock() inode lookup context > that found the inode marks it as being reclaimed (preventing other > lookups from finding it), then re-initialises the inode. This is > what makes .get_link change in the middle of pathwalk - we're > reinitialising the inode without waiting for the RCU grace period to > expire. Ok, good to know that, there's a lot of icache code to look through, ;) At this point I come back to thinking the original patch might be sufficient. But then that's only for xfs and excludes potential problems with other file systems so I'll not go there. > > The obvious thing to do here is that after we drop the RCU read > context, we simply call synchronize_rcu() before we start > re-initialising the inode to wait for the current grace period to > expire. This ensures that any pathwalk that may have found that > inode has seen the sequence number change and droppped out of > lockless mode and is no longer trying to access that inode. Then we > can safely reinitialise the inode as it has passed through a RCU > grace period just like it would have if it was freed and > reallocated. Sounds right to me, as long as it is ok to call synchronize_rcu() here. > > This completely removes the entire class of "reused inodes race with > VFS level RCU walks" bugs from the XFS inode cache implementation, > hence XFS inodes behave the same as all other filesystems w.r.t RCU > grace period expiry needing to occur before a VFS inode is reused. > > So, it looks like three patches to fix this entirely: > > 1. the pathwalk link sequence check fix > 2. zeroing the inline link buffer in xfs_ifree() I'm sorry but I'm really having trouble understanding how this is ok. If some process is using the buffer to walk a link path how can zeroing the contents of the buffer be ok? > 3. adding synchronize_rcu() (or some variant) to xfs_iget_recycle() > > Cheers, > > Dave.