Re: [snitzer:cel-nfsd-next-6.12-rc5] [nfsd] b4be3ccf1c: fsmark.app_overhead 92.0% regression

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 13 Nov 2024 11:20:34 -0500

On Wed, 2024-11-13 at 11:14 -0500, Chuck Lever wrote:
> On Wed, Nov 13, 2024 at 11:10:35AM -0500, Jeff Layton wrote:
> > On Wed, 2024-11-13 at 10:48 -0500, Chuck Lever wrote:
> > > On Thu, Nov 07, 2024 at 06:35:11AM -0500, Jeff Layton wrote:
> > > > On Thu, 2024-11-07 at 12:55 +0800, kernel test robot wrote:
> > > > > hi, Jeff Layton,
> > > > > 
> > > > > in commit message, it is mentioned the change is expected to solve the
> > > > > "App Overhead" on the fs_mark test we reported in
> > > > > https://lore.kernel.org/oe-lkp/202409161645.d44bced5-oliver.sang@xxxxxxxxx/
> > > > > 
> > > > > however, in our tests, there is sill similar regression. at the same
> > > > > time, there is still no performance difference for fsmark.files_per_sec
> > > > > 
> > > > >    2015880 ±  3%     +92.0%    3870164        fsmark.app_overhead
> > > > >      18.57            +0.0%      18.57        fsmark.files_per_sec
> > > > > 
> > > > > 
> > > > > another thing is our bot bisect to this commit in repo/branch as below detail
> > > > > information. if there is a more porper repo/branch to test the patch, could
> > > > > you let us know? thanks a lot!
> > > > > 
> > > > > 
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > kernel test robot noticed a 92.0% regression of fsmark.app_overhead on:
> > > > > 
> > > > > 
> > > > > commit: b4be3ccf1c251cbd3a3cf5391a80fe3a5f6f075e ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git cel-nfsd-next-6.12-rc5
> > > > > 
> > > > > 
> > > > > testcase: fsmark
> > > > > config: x86_64-rhel-8.3
> > > > > compiler: gcc-12
> > > > > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
> > > > > parameters:
> > > > > 
> > > > > 	iterations: 1x
> > > > > 	nr_threads: 1t
> > > > > 	disk: 1HDD
> > > > > 	fs: btrfs
> > > > > 	fs2: nfsv4
> > > > > 	filesize: 4K
> > > > > 	test_size: 40M
> > > > > 	sync_method: fsyncBeforeClose
> > > > > 	nr_files_per_directory: 1fpd
> > > > > 	cpufreq_governor: performance
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > > the same patch/commit), kindly add following tags
> > > > > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > > > > Closes: https://lore.kernel.org/oe-lkp/202411071017.ddd9e9e2-oliver.sang@xxxxxxxxx
> > > > > 
> > > > > 
> > > > > Details are as below:
> > > > > -------------------------------------------------------------------------------------------------->
> > > > > 
> > > > > 
> > > > > The kernel config and materials to reproduce are available at:
> > > > > https://download.01.org/0day-ci/archive/20241107/202411071017.ddd9e9e2-oliver.sang@xxxxxxxxx
> > > > > 
> > > > > =========================================================================================
> > > > > compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
> > > > >   gcc-12/performance/1HDD/4K/nfsv4/btrfs/1x/x86_64-rhel-8.3/1fpd/1t/debian-12-x86_64-20240206.cgz/fsyncBeforeClose/lkp-icl-2sp6/40M/fsmark
> > > > > 
> > > > > commit: 
> > > > >   37f27b20cd ("nfsd: add support for FATTR4_OPEN_ARGUMENTS")
> > > > >   b4be3ccf1c ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> > > > > 
> > > > > 37f27b20cd64e2e0 b4be3ccf1c251cbd3a3cf5391a8 
> > > > > ---------------- --------------------------- 
> > > > >          %stddev     %change         %stddev
> > > > >              \          |                \  
> > > > >      97.33 ±  9%     -16.3%      81.50 ±  9%  perf-c2c.HITM.local
> > > > >       3788 ±101%    +147.5%       9377 ±  6%  sched_debug.cfs_rq:/.load_avg.max
> > > > >       2936            -6.2%       2755        vmstat.system.cs
> > > > >    2015880 ±  3%     +92.0%    3870164        fsmark.app_overhead
> > > > >      18.57            +0.0%      18.57        fsmark.files_per_sec
> > > > >      53420           -17.3%      44185        fsmark.time.voluntary_context_switches
> > > > >       1.50 ±  7%     +13.4%       1.70 ±  3%  perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
> > > > >       3.00 ±  7%     +13.4%       3.40 ±  3%  perf-sched.wait_time.max.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
> > > > >    1756957            -2.1%    1720536        proc-vmstat.numa_hit
> > > > >    1624496            -2.2%    1588039        proc-vmstat.numa_local
> > > > >       1.28 ±  4%      -8.2%       1.17 ±  3%  perf-stat.i.MPKI
> > > > >       2916            -6.2%       2735        perf-stat.i.context-switches
> > > > >       1529 ±  4%      +8.2%       1655 ±  3%  perf-stat.i.cycles-between-cache-misses
> > > > >       2910            -6.2%       2729        perf-stat.ps.context-switches
> > > > >       0.67 ± 15%      -0.4        0.27 ±100%  perf-profile.calltrace.cycles-pp._Fork
> > > > >       0.95 ± 15%      +0.3        1.26 ± 11%  perf-profile.calltrace.cycles-pp.__sched_setaffinity.sched_setaffinity.__x64_sys_sched_setaffinity.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > > >       0.70 ± 47%      +0.3        1.04 ±  9%  perf-profile.calltrace.cycles-pp._nohz_idle_balance.do_idle.cpu_startup_entry.start_secondary.common_startup_64
> > > > >       0.52 ± 45%      +0.3        0.86 ± 15%  perf-profile.calltrace.cycles-pp.seq_read_iter.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > > >       0.72 ± 50%      +0.4        1.12 ± 18%  perf-profile.calltrace.cycles-pp.link_path_walk.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat
> > > > >       1.22 ± 26%      +0.4        1.67 ± 12%  perf-profile.calltrace.cycles-pp.sched_setaffinity.evlist_cpu_iterator__next.read_counters.process_interval.dispatch_events
> > > > >       2.20 ± 11%      +0.6        2.78 ± 13%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> > > > >       2.20 ± 11%      +0.6        2.82 ± 12%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
> > > > >       2.03 ± 13%      +0.6        2.67 ± 13%  perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> > > > >       0.68 ± 15%      -0.2        0.47 ± 19%  perf-profile.children.cycles-pp._Fork
> > > > >       0.56 ± 15%      -0.1        0.42 ± 16%  perf-profile.children.cycles-pp.tcp_v6_do_rcv
> > > > >       0.46 ± 13%      -0.1        0.35 ± 16%  perf-profile.children.cycles-pp.alloc_pages_mpol_noprof
> > > > >       0.10 ± 75%      +0.2        0.29 ± 19%  perf-profile.children.cycles-pp.refresh_cpu_vm_stats
> > > > >       0.28 ± 33%      +0.2        0.47 ± 16%  perf-profile.children.cycles-pp.show_stat
> > > > >       0.34 ± 32%      +0.2        0.54 ± 16%  perf-profile.children.cycles-pp.fold_vm_numa_events
> > > > >       0.37 ± 11%      +0.3        0.63 ± 23%  perf-profile.children.cycles-pp.setup_items_for_insert
> > > > >       0.88 ± 15%      +0.3        1.16 ± 12%  perf-profile.children.cycles-pp.__set_cpus_allowed_ptr
> > > > >       0.37 ± 14%      +0.3        0.67 ± 61%  perf-profile.children.cycles-pp.btrfs_writepages
> > > > >       0.37 ± 14%      +0.3        0.67 ± 61%  perf-profile.children.cycles-pp.extent_write_cache_pages
> > > > >       0.64 ± 19%      +0.3        0.94 ± 15%  perf-profile.children.cycles-pp.btrfs_insert_empty_items
> > > > >       0.38 ± 12%      +0.3        0.68 ± 58%  perf-profile.children.cycles-pp.btrfs_fdatawrite_range
> > > > >       0.32 ± 23%      +0.3        0.63 ± 64%  perf-profile.children.cycles-pp.extent_writepage
> > > > >       0.97 ± 14%      +0.3        1.31 ± 10%  perf-profile.children.cycles-pp.__sched_setaffinity
> > > > >       1.07 ± 16%      +0.4        1.44 ± 10%  perf-profile.children.cycles-pp.__x64_sys_sched_setaffinity
> > > > >       1.39 ± 18%      +0.5        1.90 ± 12%  perf-profile.children.cycles-pp.seq_read_iter
> > > > >       0.34 ± 30%      +0.2        0.52 ± 16%  perf-profile.self.cycles-pp.fold_vm_numa_events
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Disclaimer:
> > > > > Results have been estimated based on internal Intel analysis and are provided
> > > > > for informational purposes only. Any difference in system hardware or software
> > > > > design or configuration may affect actual performance.
> > > > > 
> > > > > 
> > > > 
> > > > This patch (b4be3ccf1c) is exceedingly simple, so I doubt it's causing
> > > > a performance regression in the server. The only thing I can figure
> > > > here is that this test is causing the server to recall the delegation
> > > > that it hands out, and then the client has to go and establish a new
> > > > open stateid in order to return it. That would likely be slower than
> > > > just handing out both an open and delegation stateid in the first
> > > > place.
> > > > 
> > > > I don't think there is anything we can do about that though, since the
> > > > feature seems to is working as designed.
> > > 
> > > We seem to have hit this problem before. That makes me wonder
> > > whether it is actually worth supporting the XOR flag at all. After
> > > all, the client sends the CLOSE asynchronously; applications are not
> > > being held up while the unneeded state ID is returned.
> > > 
> > > Can XOR support be disabled for now? I don't want to add an
> > > administrative interface for that, but also, "no regressions" is
> > > ringing in my ears, and 92% is a mighty noticeable one.
> > 
> > To be clear, this increase is for the "App Overhead" which is all of
> > the operations between the stuff that is being measured in this test. I
> > did run this test for a bit and got similar results, but was never able
> > to nail down where the overhead came from. My speculation is that it's
> > the recall and reestablishment of an open stateid that slows things
> > down, but I never could fully confirm it.
> > 
> > My issue with disabling this is that the decision of whether to set
> > OPEN_XOR_DELEGATION is up to the client. The server in this case is
> > just doing what the client asks. ISTM that if we were going to disable
> > (or throttle) this anywhere, that should be done by the client.
> 
> If the client sets the XOR flag, NFSD could simply return only an
> OPEN stateid, for instance.
> 
> 

Sure, but then it's disabled for everyone, for every workload. There
may be workloads the benefit too. What would really be nice is another
datapoint. To the HS guys:

If you run this test against the Anvil both with and without
OPEN_XOR_DELEGATION enabled, do you also see an increase in the "App
Overhead" with this enabled? It would be nice to know if this effect is
common among servers that implement this, or whether it's something
particular to knfsd.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>