FYI: I've just rebased the git tree branch containing this code on the V7 version of the xlog_write() rework patch set I just posted. No changes to this series were made in the rebase. Cheers, Dave. On Tue, Nov 09, 2021 at 12:52:26PM +1100, Dave Chinner wrote: > Time to try again to get this code merged. > > This series aims to improve the scalability of XFS transaction > commits on large CPU count machines. My 32p machine hits contention > limits in xlog_cil_commit() at about 700,000 transaction commits a > section. It hits this at 16 thread workloads, and 32 thread > workloads go no faster and just burn CPU on the CIL spinlocks. > > This patchset gets rid of spinlocks and global serialisation points > in the xlog_cil_commit() path. It does this by moving to a > combination of per-cpu counters, unordered per-cpu lists and > post-ordered per-cpu lists. > > This results in transaction commit rates exceeding 1.6 million > commits/s under unlink certain workloads, and while the log lock > contention is largely gone there is still significant lock > contention at the VFS at 600,000 transactions/s: > > 19.39% [kernel] [k] __pv_queued_spin_lock_slowpath > 6.40% [kernel] [k] do_raw_spin_lock > 4.07% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > 3.08% [kernel] [k] memcpy_erms > 1.93% [kernel] [k] xfs_buf_find > 1.69% [kernel] [k] xlog_cil_commit > 1.50% [kernel] [k] syscall_exit_to_user_mode > 1.18% [kernel] [k] memset_erms > > > - 64.23% 0.22% [kernel] [k] path_openat > - 64.01% path_openat > - 48.69% xfs_vn_create > - 48.60% xfs_generic_create > - 40.96% xfs_create > - 20.39% xfs_dir_ialloc > - 7.05% xfs_setup_inode > >>>>> - 6.87% inode_sb_list_add > - 6.54% _raw_spin_lock > - 6.53% do_raw_spin_lock > 6.08% __pv_queued_spin_lock_slowpath > ..... > - 11.27% xfs_trans_commit > - 11.23% __xfs_trans_commit > - 10.85% xlog_cil_commit > 2.47% memcpy_erms > - 1.77% xfs_buf_item_committing > - 1.70% xfs_buf_item_release > - 0.79% xfs_buf_unlock > 0.68% up > 0.61% xfs_buf_rele > 0.80% xfs_buf_item_format > 0.73% xfs_inode_item_format > 0.68% xfs_buf_item_size > - 0.55% kmem_alloc_large > - 0.55% kmem_alloc > 0.52% __kmalloc > ..... > - 7.08% d_instantiate > - 6.66% security_d_instantiate > >>>>>> - 6.63% selinux_d_instantiate > - 6.48% inode_doinit_with_dentry > - 6.11% _raw_spin_lock > - 6.09% do_raw_spin_lock > 5.60% __pv_queued_spin_lock_slowpath > .... > - 1.77% terminate_walk > >>>>>> - 1.69% dput > - 1.55% _raw_spin_lock > - do_raw_spin_lock > 1.19% __pv_queued_spin_lock_slowpath > > > But when we extend out to 1.5M commits/s we see that the contention > starts to shift to the atomics in the lockless log reservation path: > > 14.81% [kernel] [k] __pv_queued_spin_lock_slowpath > 7.88% [kernel] [k] xlog_grant_add_space > 7.18% [kernel] [k] xfs_log_ticket_ungrant > 4.82% [kernel] [k] do_raw_spin_lock > 3.58% [kernel] [k] xlog_space_left > 3.51% [kernel] [k] xlog_cil_commit > > There's still substantial spin lock contention occurring at the VFS, > too, but it's indicating that multiple atomic variable updates per > transaction reservation/commit pair is starting to reach scalability > limits here. > > This is largely a re-implementation of a past RFC patchsets. While > that were good enough proof of concept to perf test, they did not > preserve transaction order correctly and failed shutdown tests all > the time. The changes to the CIL accounting and behaviour, combined > with the structural changes to xlog_write() in prior patchsets make > the per-cpu restructuring possible and sane. > > Instead of trying to account for continuation log opheaders on a > "growth" basis, we pre-calculate how many iclogs we'll need to write > out a maximally sized CIL checkpoint and just reserve that space one > per commit until the CIL has a full reservation. If we ever run a > commit when we are already at the hard limit (because > post-throttling) we simply take an extra reservation from each > commit that is run when over the limit. Hence we don't need to do > space usage math in the fast path and so never need to sum the > per-cpu counters in this path. > > Similarly, per-cpu lists have the problem of ordering - we can't > remove an item from a per-cpu list if we want to move it forward in > the CIL. We solve this problem by using an atomic counter to give > every commit a sequence number that is copied into the log items in > that transaction. Hence relogging items just overwrites the sequence > number in the log item, and does not move it in the per-cpu lists. > Once we reaggregate the per-cpu lists back into a single list in the > CIL push work, we can run it through list-sort() and reorder it back > into a globally ordered list. This costs a bit of CPU time, but now > that the CIL can run multiple works and pipelines properly, this is > not a limiting factor for performance. It does increase fsync > latency when the CIL is full, but workloads issuing large numbers of > fsync()s or sync transactions end up with very small CILs and so the > latency impact or sorting is not measurable for such workloads. > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-cil-scale-3 > > Version 6: > - split out from aggregated patchset > - rebase on linux-xfs/for-next + dgc/xlog-write-rework > > Version 5: > - https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@xxxxxxxxxxxxx/ > > -- Dave Chinner david@xxxxxxxxxxxxx