On Tue, Jan 18, 2022 at 07:20:35PM +0000, Al Viro wrote: > On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote: > > > If I go back to the inactive -> reclaimable grace period variant and > > also insert a start_poll_synchronize_rcu() and > > poll_state_synchronize_rcu() pair across the inactive processing > > sequence, I start seeing numbers closer to ~36k cycles. IOW, the > > xfs_inodegc_inactivate() helper looks something like this: > > > > if (poll_state_synchronize_rcu(ip->i_destroy_gp)) > > xfs_inodegc_set_reclaimable(ip); > > else > > call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback); > > > > ... to skip the rcu grace period if one had already passed while the > > inode sat on the inactivation queue and was processed. > > > > However my box went haywire shortly after with rcu stall reports or > > something if I let that continue to run longer, so I'll probably have to > > look into that lockdep splat (complaining about &pag->pag_ici_lock in > > rcu context, perhaps needs to become irq safe?) or see if something else > > is busted.. > > Umm... Could you (or Dave) describe where does the mainline do the > RCU delay mentioned several times in these threads, in case of > * unlink() > * overwriting rename() > * open() + unlink() + close() (that one, obviously, for regular files) > If you're referring to the existing rcu delay in XFS, I suspect that refers to xfs_reclaim_inode(). The last bits of that function remove the inode from the perag tree and then calls __xfs_inode_free(), which frees the inode via rcu callback. BTW, I think the experiment above is either going to require an audit to make the various _set_reclaimable() locks irq safe or something a bit more ugly like putting the inode back on a workqueue after the rcu delay to make the state transition. Given the incremental improvement from using start_poll_synchronize_rcu(), I replaced the above destroy side code with a cond_synchronize_rcu(ip->i_destroy_gp) call on the iget/recycle side and see similar results (~36k cycles per 60s, but so far without any explosions). Brian > The thing is, if we already do have an RCU delay in there, there might be > a better solution - making sure it happens downstream of d_drop() (in case > when dentry has other references) or dentry going negative (in case we are > holding the sole reference to it). > > If we can do that, RCU'd dcache lookups won't run into inode reuse at all. > Sure, right now d_delete() comes last *and* we want the target inode to stay > around past the call of ->unlink(). But if you look at do_unlinkat() you'll > see an interesting-looking wart with ihold/iput around vfs_unlink(). Not > sure if you remember the story on that one; basically, it's "we'd rather > have possible IO on inode freeing to happen outside of the lock on parent". > > nfsd and mqueue do the same thing; ksmbd does not. OTOH, ksmbd appears to > force the "make victim go unhashed, sole reference or not". ecryptfs > definitely does that forcing (deliberately so). > > That area could use some rethinking, and if we can deal with the inode reuse > delay while we are at it... > > Overwriting rename() is also interesting in that respect, of course. > > I can go and try to RTFS my way through xfs iget-related code, but I'd > obviously prefer to do so with at least some overview of that thing > from the folks familiar with it. Seeing that it's a lockless search > structure, missing something subtle there is all too easy, especially > with the lookup-by-fhandle stuff in the mix... >