On Thu, Jan 20, 2022 at 11:03:28AM -0500, Brian Foster wrote: > On Thu, Jan 20, 2022 at 09:07:47AM +1100, Dave Chinner wrote: > > On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote: > > > On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote: > > > > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > > > > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > > > Unfortunately, the simple fix of adding syncronize_rcu() to > > > > xfs_iget_recycle() causes significant performance regressions > > > > because we hit this path quite frequently when workloads use lots of > > > > temporary files - the on-disk inode allocator policy tends towards > > > > aggressive re-use of inodes for small sets of temporary files. > > > > > > > > The problem XFS is trying to address is that the VFS inode lifecycle > > > > does not cater for filesystems that need to both dirty and then > > > > clean unlinked inodes between iput_final() and ->destroy_inode. It's > > > > too late to be able to put the inode back on the LRU once we've > > > > decided to drop the inode if we need to dirty it again. ANd because > > > > evict() is part of the non-blocking memory reclaim, we aren't > > > > supposed to block for arbitrarily long periods of time or create > > > > unbound memory demand processing inode eviction (both of which XFS > > > > can do in inactivation). > > > > > > > > IOWs, XFS can't free the inode until it's journal releases the > > > > internal reference on the dirty inode. ext4 doesn't track inodes in > > > > it's journal - it only tracks inode buffers that contain the changes > > > > made to the inode, so once the transaction is committed in > > > > ext4_evict_inode() the inode can be immediately freed via either > > > > ->destroy_inode or ->free_inode. That option does not exist for XFS > > > > because we have to wait for the journal to finish with the inode > > > > before it can be freed. Hence all the background reclaim stuff. > > > > > > > > We've recently solved several of the problems we need to solve to > > > > reduce the mismatch; avoiding blocking on inode writeback in reclaim > > > > and background inactivation are two of the major pieces of work we > > > > needed done before we could even consider more closely aligning XFS > > > > to the VFS inode cache life cycle model. > > > > > > > > > > The background inactivation work facilitates an incremental improvement > > > by nature because destroyed inodes go directly to a queue instead of > > > being processed synchronously. My most recent test to stamp the grace > > > period info at inode destroy time and conditionally sync at reuse time > > > shows pretty much no major cost because the common case is that a grace > > > period has already expired by the time the queue populates, is processed > > > and said inodes become reclaimable and reallocated. > > > > Yup. Remember that I suggested these conditional variants in the > > first place - I do understand what this code does... > > > > > To go beyond just > > > the performance result, if I open code the conditional sync for tracking > > > purposes I only see something like 10-15 rcu waits out of the 36k > > > allocation cycles. If I increase the background workload 4x, the > > > allocation rate drops to ~33k cycles (which is still pretty much in line > > > with baseline) and the rcu sync count increases to 70, which again is > > > relatively nominal over tens of thousands of cycles. > > > > Yup. But that doesn't mean that the calls that trigger are free from > > impact. The cost and latency of waiting for an RCU grace period to > > expire goes up as the CPU count goes up. e.g. it requires every CPU > > running a task goes through a context switch before it returns. > > Hence if we end up with situations like, say, the ioend completion > > scheduling holdoffs, then that will prevent the RCU sync from > > returning for seconds. > > > > Sure... this is part of the reason the tests I've run so far have all > tried to incorporate background rcuwalk activity, run on a higher cpu > count box, etc. And from the XFS side of the coin, the iget code can > invoke xfs_inodegc_queue_all() in the needs_inactive case before > reclaimable state is a possibility, which queues a work on every cpu > with pending inactive inodes. That is probably unlikely in the > free/alloc case (since needs_inactive inodes are not yet free on disk), > but the broader points are that the inactive processing work has to > complete one way or another before reclaimable state is possible and > that we can probably accommodate a synchronization point here if it's > reasonably filtered. Otherwise... > > > IOWs, we're effectively adding unpredictable and non-deterministic > > latency into the recycle path that is visible to userspace > > applications, and those latencies can be caused by subsystem > > functionality not related to XFS. Hence we need to carefully > > consider unexpected side-effects of adding a kernel global > > synchronisation point into a XFS icache lookup fast path, and these > > issues may not be immediately obvious from testing... > > > > ... agreed. I think at this point we've also discussed various potential > ways to shift or minimize latency/cost further, so there's probably > still room for refinement if such unexpected things crop up before... > > > > This all requires some more thorough testing, but I'm sure it won't be > > > absolutely free for every possible workload or environment. But given > > > that we know this infrastructure is fundamentally broken (by subtle > > > compatibilities between XFS and the VFS that have evolved over time), > > > will require some thought and time to fix properly in the filesystem, > > > that users are running into problems very closely related to it, why not > > > try to address the fundamental breakage if we can do so with an isolated > > > change with minimal (but probably not zero) performance impact? > > > > > > I agree that the unconditional synchronize_rcu() on reuse approach is > > > just not viable, but so far tests using cond_synchronize_rcu() seem > > > fairly reasonable. Is there some other problem or concern with such an > > > approach? > > > > Just that the impact of adding RCU sync points means that bad > > behaviour outside XFS have a new point where they can adversely > > impact on applications doing filesystem operations. > > > > As a temporary mitigation strategy I think it will probably be fine, > > but I'd much prefer we get rid of the need for such an RCU sync > > point rather than try to maintain a mitigation like this in fast > > path code forever. > > > > ... we end up here. All in all, this is intended to be a > practical/temporary step toward functional correctness that minimizes > performance impact and disruption (i.e. just as easy to remove when made > unnecessary). > > Al, > > The caveat to this is I think the practicality of a conditional sync in > the iget recycle code sort of depends on the queueing/batching nature of > inactive inode processing in XFS. If you look at xfs_fs_destroy_inode() > for example, you'll see this is all fairly recent feature/infrastructure > code and that historically we completed most of this transition to > reclaimable state before ->destroy_inode() returns. Therefore, the > concern I have is that on older/stable kernels (where users are hitting > this NULL ->get_link() BUG) the reuse code is far more likely to stall > and slow down here with this change alone (see my earlier numbers on the > unconditional sync_rcu() test for prospective worst case). For that > reason, I'm not sure this is really a backportable solution. > > So getting back to your concern around Ian's patch being a > stopgap/bandaid type solution, would you be willing to pull something > like Ian's patch to the vfs if upstream XFS adopts this conditional rcu > sync in the iget reuse code? I think that would ensure that no further > bandaid fixes will be required in the vfs due to XFS inode reuse, but > would also provide an isolated variant of the fix in the VFS that is > more easily backported to stable kernels. Thoughts? > > Brian > Oh and I meant to throw up a diff of what I was testing for reference. My test code was significantly more hacked up with debug code and such, but if I clean it up a bit the change reduces to the diff below. Brian --- 8< --- diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index d019c98eb839..8a24ced4d73a 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -349,6 +349,10 @@ xfs_iget_recycle( spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); + ASSERT(ip->i_destroy_gp != 0); + cond_synchronize_rcu(ip->i_destroy_gp); + ip->i_destroy_gp = 0; + ASSERT(!rwsem_is_locked(&inode->i_rwsem)); error = xfs_reinit_inode(mp, inode); if (error) { @@ -2019,6 +2023,7 @@ xfs_inodegc_queue( trace_xfs_inode_set_need_inactive(ip); spin_lock(&ip->i_flags_lock); ip->i_flags |= XFS_NEED_INACTIVE; + ip->i_destroy_gp = start_poll_synchronize_rcu(); spin_unlock(&ip->i_flags_lock); gc = get_cpu_ptr(mp->m_inodegc); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index c447bf04205a..a978da2332a0 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -75,6 +75,8 @@ typedef struct xfs_inode { spinlock_t i_ioend_lock; struct work_struct i_ioend_work; struct list_head i_ioend_list; + + unsigned long i_destroy_gp; } xfs_inode_t; /* Convert from vfs inode to xfs inode */