On Mon, Aug 02, 2021 at 07:49:10AM +1000, Dave Chinner wrote: > On Fri, Jul 30, 2021 at 09:21:12PM -0700, Darrick J. Wong wrote: > > On Fri, Jul 30, 2021 at 02:24:00PM +1000, Dave Chinner wrote: > > > On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote: > > And since the shrinkers are always a source of amusement, what /is/ up > > with it? I don't really like having to feed it magic numbers just to > > get it to do what I want, which is ... let it free some memory in the > > first round, then we'll kick the background workers when the priority > > bumps (er, decreases), and hope that's enough not to OOM the box. > > Well, the shrinkers are intended as a one-shot memory pressure > notification as you are trying to use them. They are intended to be > told the amount of work that needs to be done to free memory, and > they they calculate how much of that work should be done based on > it's idea of the current level of memory pressure. > > One shot shrinker triggers never tend to work well because they > treat all memory pressure the same - very light memory pressure is > dead with by the same big hammer than deals with OOM levels of > memory pressure. > > As it is, I'm more concerned right now with finding out why there's > such large performance regressions in highly concurrent recursive > chmod/unlink workloads. I spend most of friday looking at it trying > to work out what behaviour was causing the regression, but I haven't > isolated it yet. So I pulled all of my patchsets out and I'm just looking at the deferred inactivation changes now. rm -rf triggers a profile like: - 68.94% 3.24% [kernel] [k] xlog_cil_commit - 65.70% xlog_cil_commit - 55.01% _raw_spin_lock - do_raw_spin_lock 54.14% __pv_queued_spin_lock_slowpath 2.26% memcpy_erms - 1.60% xfs_buf_item_committing - 1.57% xfs_buf_item_release - 0.70% xfs_buf_unlock 0.67% up 0.57% xfs_buf_rele 1.09% xfs_buf_item_format - 0.90% _raw_spin_unlock - 0.80% do_raw_spin_unlock 0.61% __raw_callee_save___pv_queued_spin_unlock - 0.81% xfs_buf_item_size 0.57% xfs_buf_item_size_segment.isra.0 0.67% xfs_inode_item_format And the path into here is split almost exactly 50/50 between xfs_remove() (process context) and xfs_inactive (deferred context). + 40.85% 0.02% [kernel] [k] xfs_remove + 40.61% 0.00% [kernel] [k] xfs_inodegc_worker rm -rf runtime without the patchset is 2m30s, but 3m41s with it. So, essentially, the background inactivation increases the concurrency through the transaction commit path and causes a massive increase in CIL push lock contention (i.e. catastrophic lock breakdown) which makes performance go backwards. Nothing can be done about this except, well, merge the CIL scalability patches in my patch stack that address this exact lock contention problem. > I suspect that it is lockstepping between user > processes and background inactivation for the unlink - I'm seeing > the backlink rhashtable show up in the profiles which indicates the > unlinked list lengths are an issue and we're lockstepping the AGI. > It also may simply be that there is too much parallelism hammering > the transaction subsystem now.... The reason I've been seeing different symptoms is that my CIL scalability patchset entirely eliminates this spinlock contention and something else less obvious becomes the performance limiting factor. i.e. the CPU profile looks like this instead: + 33.33% 0.00% [kernel] [k] xfs_inodegc_worker + 26.45% 5.49% [kernel] [k] xlog_cil_commit + 23.13% 0.04% [kernel] [k] xfs_remove And runtime is a little lower (3m20s) but still not "fast". The fact that CPU time has gone down so much results in idle time and indicates we are now contending on a sleeping lock as the context swtich profile indicates: + 36.23% 0.00% [kernel] [k] xfs_buf_read_map .... + 22.99% 0.00% [kernel] [k] down + 22.99% 0.00% [kernel] [k] __down + 22.99% 0.00% [kernel] [k] xfs_read_agi .... + 11.83% 0.00% [kernel] [k] xfs_imap_to_bp Over a third of the context switches are on locked buffers, 2/3s of tehm on AGI buffers, and the rest are mostly on inode inode cluster buffers likely doing unlinked list modifications. IOWs, we traded CIL push lock contention for AGI and inode cluster buffer lock stepping between the unlink() syscall context and the background inactivation processes. This isn't a new problem - I've known about it for years, and I have been working towards solving it (e.g. the single unlinked list per AGI patches). In effect, the problem is that we always add newly unlinked to the head - the AGI end - of the unlinked lists, so we must always take the AGI lock in unlink() context just to update the unlinked list. On the inactivation side, we always have to take the AGI because we are freeing inodes. With deferred inactivation, the unlinked lists grow large, so we could avoid needing to modify the AGI by adding inodes to the tail of the unlinked list instead of the head. Unfortunately, to do this we currently need to store 64 agino tail pointers per AGI in memory. I might try modifying my patches to do this so I can re-order the unlinked list without needing to change on disk format. It's not very data cache or memory efficient, but it likely will avoid most of the AGI contention on these workloads. However, none of this is ready for prime time, so the next thing I'll look at is why both foreground and background are running at the same time instead of batching effectively. i.e. unlink context runs a bunch of unlinks adding inodes to the unlinked lists, then background runs doing the deferred work while foreground stays out of the way (e.g. is throttled). This will involve looking at traces, so I suspect it's going to take me a day or two to extract repeating patterns and understand them well enough to determine what to do/look at next. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx