[report] Unixbench shell1 performance regression

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Sat, 15 Mar 2025 01:19:31 +0800

Hi folks,

Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency)
performance regression during a benchmark comparison between XFS and
EXT4:  The XFS result was lower than EXT4 by 15% on Linux 6.6.y with
144-core aarch64 (64K page size).  Since Unixbench is somewhat important
to indicate overall system performance for many end users, it's not
a good result.

shell1 test[2] basically runs in a loop that it executes commands
to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove
them.  The testcase lasts for one minute and then show the total number
of iterations.

While no difference was observed in single-threaded results, it showed
a noticeable difference above if  `./Run shell1 -c 144 -i 1`  is used.

The original report was on aarch64, but I could still reproduce some
difference on Linux 6.13 with a X86 physical machine:

Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz * 96 cores
512 GiB memory

XFS (35649.6) is still lower than EXT4 (37146.0) by 4% and
the kconfig is attached.

However, I don't observe much difference on 5.10.y kernels.  After
collecting some off-CPU trace, I found there are many new agi buf
lock waits compared with the correspoinding 5.10.y trace, as below:

rm;el0t_64_sync;el0t_64_sync_handler;el0_svc;do_el0_svc;el0_svc_common.constprop.0;__arm64_sys_unlinkat;do_unlinkat;vfs_unlink;xfs_vn_unlink;xfs_remove;xfs_droplink;xfs_iunlink;xfs_read_agi;xfs_trans_read_buf_map;xfs_buf_read_map;xfs_buf_get_map;xfs_buf_lookup;xfs_buf_find_lock;xfs_buf_lock;down;__down;__down_common;___down_common;schedule_timeout;schedule;finish_task_switch.isra.0 2
..
rm;el0t_64_sync;el0t_64_sync_handler;el0_svc;do_el0_svc;el0_svc_common.constprop.0;__arm64_sys_unlinkat;do_unlinkat;vfs_unlink;xfs_vn_unlink;xfs_remove;xfs_droplink;xfs_iunlink;xfs_read_agi;xfs_trans_read_buf_map;xfs_buf_read_map;xfs_buf_get_map;xfs_buf_lookup;xfs_buf_find_lock;xfs_buf_lock;down;__down;__down_common;___down_common;schedule_timeout;schedule;finish_task_switch.isra.0 2
..
kworker/62:1;ret_from_fork;kthread;worker_thread;process_one_work;xfs_inodegc_worker;xfs_inodegc_inactivate;xfs_inactive;xfs_inactive_ifree;xfs_ifree;xfs_difree;xfs_ialloc_read_agi;xfs_read_agi;xfs_trans_read_buf_map;xfs_buf_read_map;xfs_buf_get_map;xfs_buf_lookup;xfs_buf_find_lock;xfs_buf_lock;down;__down;__down_common;___down_common;schedule_timeout;schedule;finish_task_switch.isra.0 5283
..

I tried to do some hack to disable defer inode inactivation as below,
the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9):

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7b6c026d01a1..d9fb2ef3686a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -2059,6 +2059,7 @@ void
 xfs_inodegc_start(
 	struct xfs_mount	*mp)
 {
+	return;
 	if (xfs_set_inodegc_enabled(mp))
 		return;

@@ -2180,6 +2181,12 @@ xfs_inodegc_queue(
 	ip->i_flags |= XFS_NEED_INACTIVE;
 	spin_unlock(&ip->i_flags_lock);

+	if (1) {
+		xfs_iflags_set(ip, XFS_INACTIVATING);
+		xfs_inodegc_inactivate(ip);
+		return;
+	}
+
 	cpu_nr = get_cpu();
 	gc = this_cpu_ptr(mp->m_inodegc);
 	llist_add(&ip->i_gclist, &gc->list);

I don't have extra slot for now, but hopefully this report could
be useful ;) thanks!

Thanks,
Gao Xiang

[1] https://github.com/kdlucas/byte-unixbench
[2] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/pgms/tst.sh
Attachment:
config.gz

Description: GNU Zip compressed data