On Tue, May 12, 2020 at 07:28:06PM +1000, Dave Chinner wrote: > Hi folks, > > To follow up on the interesting performance gain I found, there's > three RFC patches that follow up the two I posted earlier. These get > rid of the CIL xc_cil_lock entirely by moving the entire CIL list > and accounting to percpu structures. > > The result is that I'm topping out at about 1.12M transactions/s > and bottlenecking on VFS spinlocks in the dentry cache path walk > code and the superblock inode list lock. The XFS CIL commit path > mostly disappears from the profiles when creating about 600,000 > inodes/s: > > > - 73.42% 0.12% [kernel] [k] path_openat > - 11.29% path_openat > - 7.12% xfs_vn_create > - 7.18% xfs_vn_mknod > - 7.30% xfs_generic_create > - 6.73% xfs_create > - 2.69% xfs_dir_ialloc > - 2.98% xfs_ialloc > - 1.26% xfs_dialloc > - 1.04% xfs_dialloc_ag > - 1.02% xfs_setup_inode > - 0.90% inode_sb_list_add > >>>>> - 1.09% _raw_spin_lock > - 4.47% do_raw_spin_lock > 4.05% __pv_queued_spin_lock_slowpath > - 0.75% xfs_iget > - 2.43% xfs_trans_commit > - 3.47% __xfs_trans_commit > - 7.47% xfs_log_commit_cil > 1.60% memcpy_erms > - 1.35% xfs_buf_item_size > 0.99% xfs_buf_item_size_segment.isra.0 > 1.30% xfs_buf_item_format > - 1.44% xfs_dir_createname > - 1.60% xfs_dir2_node_addname > - 1.08% xfs_dir2_leafn_add > 0.79% xfs_dir3_leaf_check_int > - 1.09% terminate_walk > - 1.09% dput > >>>>>> - 1.42% _raw_spin_lock > - 7.75% do_raw_spin_lock > 7.19% __pv_queued_spin_lock_slowpath > - 0.99% xfs_vn_lookup > - 0.96% xfs_lookup > - 1.01% xfs_dir_lookup > - 1.24% xfs_dir2_node_lookup > - 1.09% xfs_da3_node_lookup_int > - 0.90% unlazy_walk > - 0.87% legitimize_root > - 0.94% __legitimize_path.isra.0 > - 0.91% lockref_get_not_dead > >>>>>>> - 1.28% _raw_spin_lock > - 6.85% do_raw_spin_lock > 6.29% __pv_queued_spin_lock_slowpath > - 0.82% d_lookup > __d_lookup > ..... > + 39.21% 6.76% [kernel] [k] do_raw_spin_lock > + 35.07% 0.16% [kernel] [k] _raw_spin_lock > + 32.35% 32.13% [kernel] [k] __pv_queued_spin_lock_slowpath > > So we're going 3-4x faster on this machine than without these > patches, yet we're still burning about 40% of the CPU consumed by > the workload on spinlocks. IOWs, the XFS code is running 3-4x > faster consuming half the CPU, and we're bashing on other locks > now... Just as a small followup, I started this with my usual 16-way create/unlink workload which ran at about 245k creates/s and unlinks at about 150k/s. With this patch set, I just ran 492k creates/s (1m54s) and 420k unlinks/s from just 16 threads (2m18s). IOWs, I didn't need to go to 32 threads to see the perf improvement - as the above profiles indicate, those extra 16 threads are effectively just creating heat spinning on VFS locks... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx