Re: [snitzer:cel-nfsd-next-6.12-rc5] [nfsd] b4be3ccf1c: fsmark.app_overhead 92.0% regression

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 13 Nov 2024 12:02:39 -0500

On Wed, 2024-11-13 at 16:37 +0000, Chuck Lever III wrote:
> 
> > On Nov 13, 2024, at 11:20 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > On Wed, 2024-11-13 at 11:14 -0500, Chuck Lever wrote:
> > > On Wed, Nov 13, 2024 at 11:10:35AM -0500, Jeff Layton wrote:
> > > > On Wed, 2024-11-13 at 10:48 -0500, Chuck Lever wrote:
> > > > > On Thu, Nov 07, 2024 at 06:35:11AM -0500, Jeff Layton wrote:
> > > > > > On Thu, 2024-11-07 at 12:55 +0800, kernel test robot wrote:
> > > > > > > hi, Jeff Layton,
> > > > > > > 
> > > > > > > in commit message, it is mentioned the change is expected to solve the
> > > > > > > "App Overhead" on the fs_mark test we reported in
> > > > > > > https://lore.kernel.org/oe-lkp/202409161645.d44bced5-oliver.sang@xxxxxxxxx/
> > > > > > > 
> > > > > > > however, in our tests, there is sill similar regression. at the same
> > > > > > > time, there is still no performance difference for fsmark.files_per_sec
> > > > > > > 
> > > > > > >   2015880 ±  3%     +92.0%    3870164        fsmark.app_overhead
> > > > > > >     18.57            +0.0%      18.57        fsmark.files_per_sec
> > > > > > > 
> > > > > > > 
> > > > > > > another thing is our bot bisect to this commit in repo/branch as below detail
> > > > > > > information. if there is a more porper repo/branch to test the patch, could
> > > > > > > you let us know? thanks a lot!
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Hello,
> > > > > > > 
> > > > > > > kernel test robot noticed a 92.0% regression of fsmark.app_overhead on:
> > > > > > > 
> > > > > > > 
> > > > > > > commit: b4be3ccf1c251cbd3a3cf5391a80fe3a5f6f075e ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> > > > > > > https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git cel-nfsd-next-6.12-rc5
> > > > > > > 
> > > > > > > 
> > > > > > > testcase: fsmark
> > > > > > > config: x86_64-rhel-8.3
> > > > > > > compiler: gcc-12
> > > > > > > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
> > > > > > > parameters:
> > > > > > > 
> > > > > > > iterations: 1x
> > > > > > > nr_threads: 1t
> > > > > > > disk: 1HDD
> > > > > > > fs: btrfs
> > > > > > > fs2: nfsv4
> > > > > > > filesize: 4K
> > > > > > > test_size: 40M
> > > > > > > sync_method: fsyncBeforeClose
> > > > > > > nr_files_per_directory: 1fpd
> > > > > > > cpufreq_governor: performance
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > > > > the same patch/commit), kindly add following tags
> > > > > > > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > > > > > > Closes: https://lore.kernel.org/oe-lkp/202411071017.ddd9e9e2-oliver.sang@xxxxxxxxx
> > > > > > > 
> > > > > > > 
> > > > > > > Details are as below:
> > > > > > > -------------------------------------------------------------------------------------------------->
> > > > > > > 
> > > > > > > 
> > > > > > > The kernel config and materials to reproduce are available at:
> > > > > > > https://download.01.org/0day-ci/archive/20241107/202411071017.ddd9e9e2-oliver.sang@xxxxxxxxx
> > > > > > > 
> > > > > > > =========================================================================================
> > > > > > > compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
> > > > > > >  gcc-12/performance/1HDD/4K/nfsv4/btrfs/1x/x86_64-rhel-8.3/1fpd/1t/debian-12-x86_64-20240206.cgz/fsyncBeforeClose/lkp-icl-2sp6/40M/fsmark
> > > > > > > 
> > > > > > > commit: 
> > > > > > >  37f27b20cd ("nfsd: add support for FATTR4_OPEN_ARGUMENTS")
> > > > > > >  b4be3ccf1c ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> > > > > > > 
> > > > > > > 37f27b20cd64e2e0 b4be3ccf1c251cbd3a3cf5391a8 
> > > > > > > ---------------- --------------------------- 
> > > > > > >         %stddev     %change         %stddev
> > > > > > >             \          |                \  
> > > > > > >     97.33 ±  9%     -16.3%      81.50 ±  9%  perf-c2c.HITM.local
> > > > > > >      3788 ±101%    +147.5%       9377 ±  6%  sched_debug.cfs_rq:/.load_avg.max
> > > > > > >      2936            -6.2%       2755        vmstat.system.cs
> > > > > > >   2015880 ±  3%     +92.0%    3870164        fsmark.app_overhead
> > > > > > >     18.57            +0.0%      18.57        fsmark.files_per_sec
> > > > > > >     53420           -17.3%      44185        fsmark.time.voluntary_context_switches
> > > > > > >      1.50 ±  7%     +13.4%       1.70 ±  3%  perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
> > > > > > >      3.00 ±  7%     +13.4%       3.40 ±  3%  perf-sched.wait_time.max.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
> > > > > > >   1756957            -2.1%    1720536        proc-vmstat.numa_hit
> > > > > > >   1624496            -2.2%    1588039        proc-vmstat.numa_local
> > > > > > >      1.28 ±  4%      -8.2%       1.17 ±  3%  perf-stat.i.MPKI
> > > > > > >      2916            -6.2%       2735        perf-stat.i.context-switches
> > > > > > >      1529 ±  4%      +8.2%       1655 ±  3%  perf-stat.i.cycles-between-cache-misses
> > > > > > >      2910            -6.2%       2729        perf-stat.ps.context-switches
> > > > > > >      0.67 ± 15%      -0.4        0.27 ±100%  perf-profile.calltrace.cycles-pp._Fork
> > > > > > >      0.95 ± 15%      +0.3        1.26 ± 11%  perf-profile.calltrace.cycles-pp.__sched_setaffinity.sched_setaffinity.__x64_sys_sched_setaffinity.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > > > > >      0.70 ± 47%      +0.3        1.04 ±  9%  perf-profile.calltrace.cycles-pp._nohz_idle_balance.do_idle.cpu_startup_entry.start_secondary.common_startup_64
> > > > > > >      0.52 ± 45%      +0.3        0.86 ± 15%  perf-profile.calltrace.cycles-pp.seq_read_iter.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > > > > >      0.72 ± 50%      +0.4        1.12 ± 18%  perf-profile.calltrace.cycles-pp.link_path_walk.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat
> > > > > > >      1.22 ± 26%      +0.4        1.67 ± 12%  perf-profile.calltrace.cycles-pp.sched_setaffinity.evlist_cpu_iterator__next.read_counters.process_interval.dispatch_events
> > > > > > >      2.20 ± 11%      +0.6        2.78 ± 13%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> > > > > > >      2.20 ± 11%      +0.6        2.82 ± 12%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
> > > > > > >      2.03 ± 13%      +0.6        2.67 ± 13%  perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> > > > > > >      0.68 ± 15%      -0.2        0.47 ± 19%  perf-profile.children.cycles-pp._Fork
> > > > > > >      0.56 ± 15%      -0.1        0.42 ± 16%  perf-profile.children.cycles-pp.tcp_v6_do_rcv
> > > > > > >      0.46 ± 13%      -0.1        0.35 ± 16%  perf-profile.children.cycles-pp.alloc_pages_mpol_noprof
> > > > > > >      0.10 ± 75%      +0.2        0.29 ± 19%  perf-profile.children.cycles-pp.refresh_cpu_vm_stats
> > > > > > >      0.28 ± 33%      +0.2        0.47 ± 16%  perf-profile.children.cycles-pp.show_stat
> > > > > > >      0.34 ± 32%      +0.2        0.54 ± 16%  perf-profile.children.cycles-pp.fold_vm_numa_events
> > > > > > >      0.37 ± 11%      +0.3        0.63 ± 23%  perf-profile.children.cycles-pp.setup_items_for_insert
> > > > > > >      0.88 ± 15%      +0.3        1.16 ± 12%  perf-profile.children.cycles-pp.__set_cpus_allowed_ptr
> > > > > > >      0.37 ± 14%      +0.3        0.67 ± 61%  perf-profile.children.cycles-pp.btrfs_writepages
> > > > > > >      0.37 ± 14%      +0.3        0.67 ± 61%  perf-profile.children.cycles-pp.extent_write_cache_pages
> > > > > > >      0.64 ± 19%      +0.3        0.94 ± 15%  perf-profile.children.cycles-pp.btrfs_insert_empty_items
> > > > > > >      0.38 ± 12%      +0.3        0.68 ± 58%  perf-profile.children.cycles-pp.btrfs_fdatawrite_range
> > > > > > >      0.32 ± 23%      +0.3        0.63 ± 64%  perf-profile.children.cycles-pp.extent_writepage
> > > > > > >      0.97 ± 14%      +0.3        1.31 ± 10%  perf-profile.children.cycles-pp.__sched_setaffinity
> > > > > > >      1.07 ± 16%      +0.4        1.44 ± 10%  perf-profile.children.cycles-pp.__x64_sys_sched_setaffinity
> > > > > > >      1.39 ± 18%      +0.5        1.90 ± 12%  perf-profile.children.cycles-pp.seq_read_iter
> > > > > > >      0.34 ± 30%      +0.2        0.52 ± 16%  perf-profile.self.cycles-pp.fold_vm_numa_events
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Disclaimer:
> > > > > > > Results have been estimated based on internal Intel analysis and are provided
> > > > > > > for informational purposes only. Any difference in system hardware or software
> > > > > > > design or configuration may affect actual performance.
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > This patch (b4be3ccf1c) is exceedingly simple, so I doubt it's causing
> > > > > > a performance regression in the server. The only thing I can figure
> > > > > > here is that this test is causing the server to recall the delegation
> > > > > > that it hands out, and then the client has to go and establish a new
> > > > > > open stateid in order to return it. That would likely be slower than
> > > > > > just handing out both an open and delegation stateid in the first
> > > > > > place.
> > > > > > 
> > > > > > I don't think there is anything we can do about that though, since the
> > > > > > feature seems to is working as designed.
> > > > > 
> > > > > We seem to have hit this problem before. That makes me wonder
> > > > > whether it is actually worth supporting the XOR flag at all. After
> > > > > all, the client sends the CLOSE asynchronously; applications are not
> > > > > being held up while the unneeded state ID is returned.
> > > > > 
> > > > > Can XOR support be disabled for now? I don't want to add an
> > > > > administrative interface for that, but also, "no regressions" is
> > > > > ringing in my ears, and 92% is a mighty noticeable one.
> > > > 
> > > > To be clear, this increase is for the "App Overhead" which is all of
> > > > the operations between the stuff that is being measured in this test. I
> > > > did run this test for a bit and got similar results, but was never able
> > > > to nail down where the overhead came from. My speculation is that it's
> > > > the recall and reestablishment of an open stateid that slows things
> > > > down, but I never could fully confirm it.
> > > > 
> > > > My issue with disabling this is that the decision of whether to set
> > > > OPEN_XOR_DELEGATION is up to the client. The server in this case is
> > > > just doing what the client asks. ISTM that if we were going to disable
> > > > (or throttle) this anywhere, that should be done by the client.
> > > 
> > > If the client sets the XOR flag, NFSD could simply return only an
> > > OPEN stateid, for instance.
> > 
> > Sure, but then it's disabled for everyone, for every workload. There
> > may be workloads the benefit too.
> 
> It doesn't impact NFSv4.[01] mounts, nor will it impact any
> workload on a client that does not now support XOR.
> 
> I don't think we need bad press around this feature, nor do we
> want complaints about "why didn't you add a switch to disable
> this?". I'd also like to avoid a bug filed for this regression
> if it gets merged.
> 
> Can you test to see if returning only the OPEN stateid makes the
> regression go away? The behavior would be in place only until
> there is a root cause and a way to avoid the regression.
> 

I don't have a test rig set up for this at the moment as this sort of
testing requires real hw. I also don't have anything automated for
doing this.

That will probably take a while, so if you're that concerned, then just
drop the patch. Maybe we'll get a datapoint from the hammerspace folks
that will help clarify the problem in the meantime.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>