Re: [linux-next:master] [nfsd] 8cb33389f6: fsmark.app_overhead 81.6% regression

Oliver Sang <oliver.sang@xxxxxxxxx> · Tue, 8 Oct 2024 16:01:37 +0800

hi, Jeff Layton,

On Tue, Oct 01, 2024 at 08:44:43AM -0400, Jeff Layton wrote:
> On Mon, 2024-09-16 at 16:41 +0800, kernel test robot wrote:
> > 
> > Hello,
> > 
> > kernel test robot noticed a 81.6% regression of fsmark.app_overhead on:
> > 
> > 
> > commit: 8cb33389f66441dc4e54b28fe0d9bd4bcd9b796d ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > 
> > testcase: fsmark
> > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
> > parameters:
> > 
> > 	iterations: 1x
> > 	nr_threads: 1t
> > 	disk: 1HDD
> > 	fs: btrfs
> > 	fs2: nfsv4
> > 	filesize: 4K
> > 	test_size: 40M
> > 	sync_method: fsyncBeforeClose
> > 	nr_files_per_directory: 1fpd
> > 	cpufreq_governor: performance
> > 
> > 
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > Closes: https://lore.kernel.org/oe-lkp/202409161645.d44bced5-oliver.sang@xxxxxxxxx
> > 
> > 
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> > 
> > 
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20240916/202409161645.d44bced5-oliver.sang@xxxxxxxxx
> > 
> > =========================================================================================
> > compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
> >   gcc-12/performance/1HDD/4K/nfsv4/btrfs/1x/x86_64-rhel-8.3/1fpd/1t/debian-12-x86_64-20240206.cgz/fsyncBeforeClose/lkp-icl-2sp6/40M/fsmark
> > 
> > commit: 
> >   e29c78a693 ("nfsd: add support for FATTR4_OPEN_ARGUMENTS")
> >   8cb33389f6 ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> > 
> > e29c78a6936e7422 8cb33389f66441dc4e54b28fe0d 
> > ---------------- --------------------------- 
> >          %stddev     %change         %stddev
> >              \          |                \  
> >      24388 ± 20%     -32.8%      16400 ± 18%  numa-vmstat.node0.nr_slab_reclaimable
> >      61.50 ±  4%     -10.6%      55.00 ±  6%  perf-c2c.HITM.local
> >       0.20 ±  3%     +23.0%       0.24 ± 13%  perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
> >       2977            -6.1%       2796        vmstat.system.cs
> >    2132466 ±  2%     +81.6%    3871852        fsmark.app_overhead
> 
> I have been unable to reproduce this result with fs_mark. I've run a
> number of repeated tests, and I can create files just as fast with or
> without this patch (roughly ~46 files/s on my test machine).
> 
> I'm particularly suspicious of the fsmark.app_overhead value above. The
> fsmark output says:
> 
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> 
> That seems outside the purview of anything we're altering here, so I
> have to wonder if something else is going on. Oliver, can you rerun
> this test and see if this regression is reproducible?

we rebuild kernel and rerun tests for both parent and this commit, the data
is stable and the regression is still reproducible.

but one thing I want to mention is there is no diff for "fsmark.files_per_sec"

=========================================================================================
compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
  gcc-12/performance/1HDD/4K/nfsv4/btrfs/1x/x86_64-rhel-8.3/1fpd/1t/debian-12-x86_64-20240206.cgz/fsyncBeforeClose/lkp-icl-2sp6/40M/fsmark

e29c78a6936e7422 8cb33389f66441dc4e54b28fe0d
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
   2080540 ±  2%     +86.6%    3881862        fsmark.app_overhead
     18.56            +0.0%      18.56        fsmark.files_per_sec

we happen to report another report which has similar behavior that big changes
of fsmark.app_overhead, but little difference for fsmark.files_per_sec
https://lore.kernel.org/all/202410072214.11d18a3c-oliver.sang@xxxxxxxxx/

Christian Loehle <christian.loehle@xxxxxxx> shared some information in
https://lore.kernel.org/all/01c3073b-84d0-4986-b6d5-a8877ae8a046@xxxxxxx/

"App overhead is time in microseconds spent in the test not doing file writing related system calls."
So the loop is:
	/*
	 * MAIN FILE WRITE LOOP:
	 * This loop measures the specific steps in creating files:
	 *      Step 1: Make up a file name
	 *      Step 2: Creat(file_name);
	 *      Step 3: write file data
	 *      Step 4: fsync() file data (optional)
	 *      Step 5: close() file descriptor
	 */

And it gets the timestamps before and after each syscall.
It then subtracts all those times (spent in syscalls) from the total time.

not sure if this information is helpful to you?

> 
> Thanks,
> 
> >      53442           -17.3%      44172        fsmark.time.voluntary_context_switches
> >       2907            -5.7%       2742        perf-stat.i.context-switches
> >       2902            -5.7%       2737        perf-stat.ps.context-switches
> >    1724787            -1.0%    1706808        proc-vmstat.numa_hit
> >    1592345            -1.1%    1574310        proc-vmstat.numa_local
> >      24.87 ± 33%     -38.9%      15.20 ± 12%  sched_debug.cpu.nr_uninterruptible.max
> >       4.36 ±  9%     -17.1%       3.61 ± 10%  sched_debug.cpu.nr_uninterruptible.stddev
> >      97541 ± 20%     -32.7%      65610 ± 18%  numa-meminfo.node0.KReclaimable
> >      97541 ± 20%     -32.7%      65610 ± 18%  numa-meminfo.node0.SReclaimable
> >     256796 ±  9%     -18.7%     208805 ± 13%  numa-meminfo.node0.Slab
> >    2307911 ± 52%     +68.5%    3888971 ±  5%  numa-meminfo.node1.MemUsed
> >     193326 ± 12%     +24.7%     241049 ± 12%  numa-meminfo.node1.Slab
> >       0.90 ± 27%      -0.5        0.36 ±103%  perf-profile.calltrace.cycles-pp.evsel__read_counter.read_counters.process_interval.dispatch_events.cmd_stat
> >       0.36 ± 70%      +0.2        0.58 ±  3%  perf-profile.calltrace.cycles-pp.btrfs_commit_transaction.btrfs_sync_file.btrfs_do_write_iter.do_iter_readv_writev.vfs_iter_write
> >       0.52 ± 47%      +0.3        0.78 ±  8%  perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
> >       1.62 ± 12%      +0.3        1.93 ±  9%  perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
> >       1.22 ± 21%      -0.3        0.89 ± 10%  perf-profile.children.cycles-pp.readn
> >       0.46 ± 32%      -0.2        0.24 ± 34%  perf-profile.children.cycles-pp.__close
> >       0.45 ± 32%      -0.2        0.22 ± 15%  perf-profile.children.cycles-pp.__x64_sys_close
> >       0.40 ± 29%      -0.2        0.18 ± 38%  perf-profile.children.cycles-pp.__fput
> >       0.31 ± 23%      -0.2        0.16 ± 33%  perf-profile.children.cycles-pp.irq_work_tick
> >       0.17 ± 51%      -0.1        0.03 ±111%  perf-profile.children.cycles-pp.nfs_file_release
> >       0.16 ± 43%      -0.1        0.03 ±111%  perf-profile.children.cycles-pp.__put_nfs_open_context
> >       0.26 ± 18%      -0.1        0.15 ± 34%  perf-profile.children.cycles-pp.perf_event_task_tick
> >       0.15 ± 41%      -0.1        0.03 ±108%  perf-profile.children.cycles-pp.get_free_pages_noprof
> >       0.18 ± 55%      -0.1        0.06 ± 32%  perf-profile.children.cycles-pp.native_apic_mem_eoi
> >       0.18 ± 32%      -0.1        0.07 ± 81%  perf-profile.children.cycles-pp.flush_end_io
> >       0.17 ± 41%      -0.1        0.07 ± 93%  perf-profile.children.cycles-pp.mas_store_gfp
> >       0.52 ±  5%      +0.1        0.58 ±  3%  perf-profile.children.cycles-pp.btrfs_commit_transaction
> >       0.02 ±141%      +0.1        0.08 ± 42%  perf-profile.children.cycles-pp.uptime_proc_show
> >       0.02 ±141%      +0.1        0.08 ± 44%  perf-profile.children.cycles-pp.get_zeroed_page_noprof
> >       0.02 ±141%      +0.1        0.09 ± 35%  perf-profile.children.cycles-pp.__rmqueue_pcplist
> >       0.14 ± 12%      +0.1        0.28 ± 29%  perf-profile.children.cycles-pp.hrtimer_next_event_without
> >       0.47 ± 27%      +0.2        0.67 ± 19%  perf-profile.children.cycles-pp.__mmap
> >       0.70 ± 21%      +0.2        0.91 ±  7%  perf-profile.children.cycles-pp.vfs_write
> >       0.74 ± 20%      +0.2        0.96 ±  9%  perf-profile.children.cycles-pp.ksys_write
> >       0.73 ± 21%      +0.3        1.00 ±  7%  perf-profile.children.cycles-pp.copy_process
> >       1.05 ± 13%      +0.3        1.38 ± 10%  perf-profile.children.cycles-pp.kernel_clone
> >       0.28 ± 22%      -0.1        0.13 ± 35%  perf-profile.self.cycles-pp.irq_work_tick
> >       0.18 ± 55%      -0.1        0.06 ± 32%  perf-profile.self.cycles-pp.native_apic_mem_eoi
> > 
> > 
> > 
> > 
> > Disclaimer:
> > Results have been estimated based on internal Intel analysis and are provided
> > for informational purposes only. Any difference in system hardware or software
> > design or configuration may affect actual performance.
> > 
> > 
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>