Hi folks, Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency) performance regression during a benchmark comparison between XFS and EXT4: The XFS result was lower than EXT4 by 15% on Linux 6.6.y with 144-core aarch64 (64K page size). Since Unixbench is somewhat important to indicate overall system performance for many end users, it's not a good result. shell1 test[2] basically runs in a loop that it executes commands to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove them. The testcase lasts for one minute and then show the total number of iterations. While no difference was observed in single-threaded results, it showed a noticeable difference above if `./Run shell1 -c 144 -i 1` is used. The original report was on aarch64, but I could still reproduce some difference on Linux 6.13 with a X86 physical machine: Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz * 96 cores 512 GiB memory XFS (35649.6) is still lower than EXT4 (37146.0) by 4% and the kconfig is attached. However, I don't observe much difference on 5.10.y kernels. After collecting some off-CPU trace, I found there are many new agi buf lock waits compared with the correspoinding 5.10.y trace, as below: rm;el0t_64_sync;el0t_64_sync_handler;el0_svc;do_el0_svc;el0_svc_common.constprop.0;__arm64_sys_unlinkat;do_unlinkat;vfs_unlink;xfs_vn_unlink;xfs_remove;xfs_droplink;xfs_iunlink;xfs_read_agi;xfs_trans_read_buf_map;xfs_buf_read_map;xfs_buf_get_map;xfs_buf_lookup;xfs_buf_find_lock;xfs_buf_lock;down;__down;__down_common;___down_common;schedule_timeout;schedule;finish_task_switch.isra.0 2 .. rm;el0t_64_sync;el0t_64_sync_handler;el0_svc;do_el0_svc;el0_svc_common.constprop.0;__arm64_sys_unlinkat;do_unlinkat;vfs_unlink;xfs_vn_unlink;xfs_remove;xfs_droplink;xfs_iunlink;xfs_read_agi;xfs_trans_read_buf_map;xfs_buf_read_map;xfs_buf_get_map;xfs_buf_lookup;xfs_buf_find_lock;xfs_buf_lock;down;__down;__down_common;___down_common;schedule_timeout;schedule;finish_task_switch.isra.0 2 .. kworker/62:1;ret_from_fork;kthread;worker_thread;process_one_work;xfs_inodegc_worker;xfs_inodegc_inactivate;xfs_inactive;xfs_inactive_ifree;xfs_ifree;xfs_difree;xfs_ialloc_read_agi;xfs_read_agi;xfs_trans_read_buf_map;xfs_buf_read_map;xfs_buf_get_map;xfs_buf_lookup;xfs_buf_find_lock;xfs_buf_lock;down;__down;__down_common;___down_common;schedule_timeout;schedule;finish_task_switch.isra.0 5283 .. I tried to do some hack to disable defer inode inactivation as below, the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9): diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 7b6c026d01a1..d9fb2ef3686a 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -2059,6 +2059,7 @@ void xfs_inodegc_start( struct xfs_mount *mp) { + return; if (xfs_set_inodegc_enabled(mp)) return; @@ -2180,6 +2181,12 @@ xfs_inodegc_queue( ip->i_flags |= XFS_NEED_INACTIVE; spin_unlock(&ip->i_flags_lock); + if (1) { + xfs_iflags_set(ip, XFS_INACTIVATING); + xfs_inodegc_inactivate(ip); + return; + } + cpu_nr = get_cpu(); gc = this_cpu_ptr(mp->m_inodegc); llist_add(&ip->i_gclist, &gc->list); I don't have extra slot for now, but hopefully this report could be useful ;) thanks! Thanks, Gao Xiang [1] https://github.com/kdlucas/byte-unixbench [2] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/pgms/tst.sh
Attachment:
config.gz
Description: GNU Zip compressed data