On Fri, Jul 08, 2022 at 09:33:47AM +1000, Dave Chinner wrote: > Hi Darrick, > > Can you please pull the CIL scalability improvements for 5.20 from > the tag below? This branch is based on the linux-xfs/for-next branch > as of 2 days ago, so should apply without any merge issues at all. > > Cheers, > > Dave. > > The following changes since commit 7561cea5dbb97fecb952548a0fb74fb105bf4664: > > xfs: prevent a UAF when log IO errors race with unmount (2022-07-01 09:09:52 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-cil-scale-5.20 > > for you to fetch changes up to 51a117edff133a1ea8cb0fcbc599b8d5a34414e9: > > xfs: expanding delayed logging design with background material (2022-07-07 18:56:09 +1000) > > ---------------------------------------------------------------- > xfs: improve CIL scalability > > This series aims to improve the scalability of XFS transaction > commits on large CPU count machines. My 32p machine hits contention > limits in xlog_cil_commit() at about 700,000 transaction commits a > section. It hits this at 16 thread workloads, and 32 thread > workloads go no faster and just burn CPU on the CIL spinlocks. > > This patchset gets rid of spinlocks and global serialisation points > in the xlog_cil_commit() path. It does this by moving to a > combination of per-cpu counters, unordered per-cpu lists and > post-ordered per-cpu lists. FWIW, I (rather infrequently) see things like this in the 10 months or so that this has been in mainline: run fstests generic/650 at 2023-05-10 19:17:09 XFS (sda3): EXPERIMENTAL Large extent counts feature in use. Use at your own risk! XFS (sda3): Mounting V5 Filesystem 75c42b12-8a39-4ecd-aac4-6b6ab0e384bd XFS (sda3): Ending clean mount smpboot: CPU 1 is now offline x86: Booting SMP configuration: smpboot: Booting Node 0 Processor 1 APIC 0x1 smpboot: CPU 1 is now offline smpboot: CPU 3 is now offline x86: Booting SMP configuration: smpboot: Booting Node 0 Processor 1 APIC 0x1 smpboot: Booting Node 0 Processor 3 APIC 0x3 smpboot: CPU 3 is now offline smpboot: Booting Node 0 Processor 3 APIC 0x3 smpboot: CPU 2 is now offline smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2). XFS (sda3): Please unmount the filesystem and rectify the problem(s). Not sure what that's about, but given the recent discussions about percpu counters not quite working correctly when racing with cpu hotremove, I figured this would be a good time to capture one of the failures and report it to the list. --D > This results in transaction commit rates exceeding 1.4 million > commits/s under unlink certain workloads, and while the log lock > contention is largely gone there is still significant lock > contention in the VFS (dentry cache, inode cache and security layers) > at >600,000 transactions/s that still limit scalability. > > The changes to the CIL accounting and behaviour, combined with the > structural changes to xlog_write() in prior patchsets make the > per-cpu restructuring possible and sane. This allows us to move to > precalculated reservation requirements that allow for reservation > stealing to be accounted across multiple CPUs accurately. > > That is, instead of trying to account for continuation log opheaders > on a "growth" basis, we pre-calculate how many iclogs we'll need to > write out a maximally sized CIL checkpoint and steal that reserveD > that space one commit at a time until the CIL has a full > reservation. If we ever run a commit when we are already at the hard > limit (because post-throttling) we simply take an extra reservation > from each commit that is run when over the limit. Hence we don't > need to do space usage math in the fast path and so never need to > sum the per-cpu counters in this fast path. > > Similarly, per-cpu lists have the problem of ordering - we can't > remove an item from a per-cpu list if we want to move it forward in > the CIL. We solve this problem by using an atomic counter to give > every commit a sequence number that is copied into the log items in > that transaction. Hence relogging items just overwrites the sequence > number in the log item, and does not move it in the per-cpu lists. > Once we reaggregate the per-cpu lists back into a single list in the > CIL push work, we can run it through list-sort() and reorder it back > into a globally ordered list. This costs a bit of CPU time, but now > that the CIL can run multiple works and pipelines properly, this is > not a limiting factor for performance. It does increase fsync > latency when the CIL is full, but workloads issuing large numbers of > fsync()s or sync transactions end up with very small CILs and so the > latency impact or sorting is not measurable for such workloads. > > OVerall, this pushes the transaction commit bottleneck out to the > lockless reservation grant head updates. These atomic updates don't > start to be a limiting fact until > 1.5 million transactions/s are > being run, at which point the accounting functions start to show up > in profiles as the highest CPU users. Still, this series doubles > transaction throughput without increasing CPU usage before we get > to that cacheline contention breakdown point... > ` > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > ---------------------------------------------------------------- > Dave Chinner (14): > xfs: use the CIL space used counter for emptiness checks > xfs: lift init CIL reservation out of xc_cil_lock > xfs: rework per-iclog header CIL reservation > xfs: introduce per-cpu CIL tracking structure > xfs: implement percpu cil space used calculation > xfs: track CIL ticket reservation in percpu structure > xfs: convert CIL busy extents to per-cpu > xfs: Add order IDs to log items in CIL > xfs: convert CIL to unordered per cpu lists > xfs: convert log vector chain to use list heads > xfs: move CIL ordering to the logvec chain > xfs: avoid cil push lock if possible > xfs: xlog_sync() manually adjusts grant head space > xfs: expanding delayed logging design with background material > > Documentation/filesystems/xfs-delayed-logging-design.rst | 361 +++++++++++++++++++++++++++++++++++++++++++++++------ > fs/xfs/xfs_log.c | 55 ++++++--- > fs/xfs/xfs_log.h | 3 +- > fs/xfs/xfs_log_cil.c | 472 +++++++++++++++++++++++++++++++++++++++++++++++++++++----------------- > fs/xfs/xfs_log_priv.h | 58 ++++++--- > fs/xfs/xfs_super.c | 1 + > fs/xfs/xfs_trans.c | 4 +- > fs/xfs/xfs_trans.h | 1 + > fs/xfs/xfs_trans_priv.h | 3 +- > 9 files changed, 768 insertions(+), 190 deletions(-) > > -- > Dave Chinner > david@xxxxxxxxxxxxx