Re: [PATCH 00/14 v6] xfs: improve CIL scalability

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 19 Nov 2021 10:15:54 +1100

FYI: I've just rebased the git tree branch containing this code
on the V7 version of the xlog_write() rework patch set I just
posted.

No changes to this series were made in the rebase.

Cheers,

Dave.

On Tue, Nov 09, 2021 at 12:52:26PM +1100, Dave Chinner wrote:
> Time to try again to get this code merged.
> 
> This series aims to improve the scalability of XFS transaction
> commits on large CPU count machines. My 32p machine hits contention
> limits in xlog_cil_commit() at about 700,000 transaction commits a
> section. It hits this at 16 thread workloads, and 32 thread
> workloads go no faster and just burn CPU on the CIL spinlocks.
> 
> This patchset gets rid of spinlocks and global serialisation points
> in the xlog_cil_commit() path. It does this by moving to a
> combination of per-cpu counters, unordered per-cpu lists and
> post-ordered per-cpu lists.
> 
> This results in transaction commit rates exceeding 1.6 million
> commits/s under unlink certain workloads, and while the log lock
> contention is largely gone there is still significant lock
> contention at the VFS at 600,000 transactions/s:
> 
>   19.39%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    6.40%  [kernel]  [k] do_raw_spin_lock
>    4.07%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    3.08%  [kernel]  [k] memcpy_erms
>    1.93%  [kernel]  [k] xfs_buf_find
>    1.69%  [kernel]  [k] xlog_cil_commit
>    1.50%  [kernel]  [k] syscall_exit_to_user_mode
>    1.18%  [kernel]  [k] memset_erms
> 
> 
> -   64.23%     0.22%  [kernel]            [k] path_openat
>    - 64.01% path_openat
>       - 48.69% xfs_vn_create
>          - 48.60% xfs_generic_create
>             - 40.96% xfs_create
>                - 20.39% xfs_dir_ialloc
>                   - 7.05% xfs_setup_inode
> >>>>>                - 6.87% inode_sb_list_add
>                         - 6.54% _raw_spin_lock
>                            - 6.53% do_raw_spin_lock
>                                 6.08% __pv_queued_spin_lock_slowpath
> .....
>                - 11.27% xfs_trans_commit
>                   - 11.23% __xfs_trans_commit
>                      - 10.85% xlog_cil_commit
>                           2.47% memcpy_erms
>                         - 1.77% xfs_buf_item_committing
>                            - 1.70% xfs_buf_item_release
>                               - 0.79% xfs_buf_unlock
>                                    0.68% up
>                                 0.61% xfs_buf_rele
>                           0.80% xfs_buf_item_format
>                           0.73% xfs_inode_item_format
>                           0.68% xfs_buf_item_size
>                         - 0.55% kmem_alloc_large
>                            - 0.55% kmem_alloc
>                                 0.52% __kmalloc
> .....
>             - 7.08% d_instantiate
>                - 6.66% security_d_instantiate
> >>>>>>            - 6.63% selinux_d_instantiate
>                      - 6.48% inode_doinit_with_dentry
>                         - 6.11% _raw_spin_lock
>                            - 6.09% do_raw_spin_lock
>                                 5.60% __pv_queued_spin_lock_slowpath
> ....
>       - 1.77% terminate_walk
> >>>>>>   - 1.69% dput
>             - 1.55% _raw_spin_lock
>                - do_raw_spin_lock
>                     1.19% __pv_queued_spin_lock_slowpath
> 
> 
> But when we extend out to 1.5M commits/s we see that the contention
> starts to shift to the atomics in the lockless log reservation path:
> 
>   14.81%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    7.88%  [kernel]  [k] xlog_grant_add_space
>    7.18%  [kernel]  [k] xfs_log_ticket_ungrant
>    4.82%  [kernel]  [k] do_raw_spin_lock
>    3.58%  [kernel]  [k] xlog_space_left
>    3.51%  [kernel]  [k] xlog_cil_commit
> 
> There's still substantial spin lock contention occurring at the VFS,
> too, but it's indicating that multiple atomic variable updates per
> transaction reservation/commit pair is starting to reach scalability
> limits here.
> 
> This is largely a re-implementation of a past RFC patchsets. While
> that were good enough proof of concept to perf test, they did not
> preserve transaction order correctly and failed shutdown tests all
> the time. The changes to the CIL accounting and behaviour, combined
> with the structural changes to xlog_write() in prior patchsets make
> the per-cpu restructuring possible and sane.
> 
> Instead of trying to account for continuation log opheaders on a
> "growth" basis, we pre-calculate how many iclogs we'll need to write
> out a maximally sized CIL checkpoint and just reserve that space one
> per commit until the CIL has a full reservation. If we ever run a
> commit when we are already at the hard limit (because
> post-throttling) we simply take an extra reservation from each
> commit that is run when over the limit. Hence we don't need to do
> space usage math in the fast path and so never need to sum the
> per-cpu counters in this path.
> 
> Similarly, per-cpu lists have the problem of ordering - we can't
> remove an item from a per-cpu list if we want to move it forward in
> the CIL. We solve this problem by using an atomic counter to give
> every commit a sequence number that is copied into the log items in
> that transaction. Hence relogging items just overwrites the sequence
> number in the log item, and does not move it in the per-cpu lists.
> Once we reaggregate the per-cpu lists back into a single list in the
> CIL push work, we can run it through list-sort() and reorder it back
> into a globally ordered list. This costs a bit of CPU time, but now
> that the CIL can run multiple works and pipelines properly, this is
> not a limiting factor for performance. It does increase fsync
> latency when the CIL is full, but workloads issuing large numbers of
> fsync()s or sync transactions end up with very small CILs and so the
> latency impact or sorting is not measurable for such workloads.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-cil-scale-3
> 
> Version 6:
> - split out from aggregated patchset
> - rebase on linux-xfs/for-next + dgc/xlog-write-rework
> 
> Version 5:
> - https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@xxxxxxxxxxxxx/
> 
> 

-- 
Dave Chinner
david@xxxxxxxxxxxxx