On 5/19/20 3:23 PM, Dave Chinner wrote: > From: Dave Chinner<dchinner@xxxxxxxxxx> > > Seeing massive cpu usage from xfs_agino_range() on one machine; > instruction level profiles look similar to another machine running > the same workload, only one machien is consuming 10x as much CPU as 's/machien/machine/', can be done at the time of applying patch. > the other and going much slower. The only real difference between > the two machines is core count per socket. Both are running > identical 16p/16GB virtual machine configurations > > Machine A: > > 25.83% [k] xfs_agino_range > 12.68% [k] __xfs_dir3_data_check > 6.95% [k] xfs_verify_ino > 6.78% [k] xfs_dir2_data_entry_tag_p > 3.56% [k] xfs_buf_find > 2.31% [k] xfs_verify_dir_ino > 2.02% [k] xfs_dabuf_map.constprop.0 > 1.65% [k] xfs_ag_block_count > > And takes around 13 minutes to remove 50 million inodes. > > Machine B: > > 13.90% [k] __pv_queued_spin_lock_slowpath > 3.76% [k] do_raw_spin_lock > 2.83% [k] xfs_dir3_leaf_check_int > 2.75% [k] xfs_agino_range > 2.51% [k] __raw_callee_save___pv_queued_spin_unlock > 2.18% [k] __xfs_dir3_data_check > 2.02% [k] xfs_log_commit_cil > > And takes around 5m30s to remove 50 million inodes. > > Suspect is cacheline contention on m_sectbb_log which is used in one > of the macros in xfs_agino_range. This is a read-only variable but > shares a cacheline with m_active_trans which is a global atomic that > gets bounced all around the machine. > > The workload is trying to run hundreds of thousands of transactions > per second and hence cacheline contention will be occuring on this 's/occuring/occurring/', can be done at the time of applying patch. > atomic counter. Hence xfs_agino_range() is likely just be an > innocent bystander as the cache coherency protocol fights over the > cacheline between CPU cores and sockets. > > On machine A, this rearrangement of the struct xfs_mount > results in the profile changing to: > > 9.77% [kernel] [k] xfs_agino_range > 6.27% [kernel] [k] __xfs_dir3_data_check > 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath > 4.54% [kernel] [k] xfs_buf_find > 3.79% [kernel] [k] do_raw_spin_lock > 3.39% [kernel] [k] xfs_verify_ino > 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > > Vastly less CPU usage in xfs_agino_range(), but still 3x the amount > of machine B and still runs substantially slower than it should. > > Current rm -rf of 50 million files: > > vanilla patched > machine A 13m20s 6m42s > machine B 5m30s 5m02s > > It's an improvement, hence indicating that separation and further > optimisation of read-only global filesystem data is worthwhile, but > it clearly isn't the underlying issue causing this specific > performance degradation. > > Signed-off-by: Dave Chinner<dchinner@xxxxxxxxxx> > ---