Re: [linux-next:master] [nfsd] 8cb33389f6: fsmark.app_overhead 81.6% regression

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 01 Oct 2024 08:44:43 -0400

On Mon, 2024-09-16 at 16:41 +0800, kernel test robot wrote:
> 
> Hello,
> 
> kernel test robot noticed a 81.6% regression of fsmark.app_overhead on:
> 
> 
> commit: 8cb33389f66441dc4e54b28fe0d9bd4bcd9b796d ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> testcase: fsmark
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
> parameters:
> 
> 	iterations: 1x
> 	nr_threads: 1t
> 	disk: 1HDD
> 	fs: btrfs
> 	fs2: nfsv4
> 	filesize: 4K
> 	test_size: 40M
> 	sync_method: fsyncBeforeClose
> 	nr_files_per_directory: 1fpd
> 	cpufreq_governor: performance
> 
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > Closes: https://lore.kernel.org/oe-lkp/202409161645.d44bced5-oliver.sang@xxxxxxxxx
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240916/202409161645.d44bced5-oliver.sang@xxxxxxxxx
> 
> =========================================================================================
> compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
>   gcc-12/performance/1HDD/4K/nfsv4/btrfs/1x/x86_64-rhel-8.3/1fpd/1t/debian-12-x86_64-20240206.cgz/fsyncBeforeClose/lkp-icl-2sp6/40M/fsmark
> 
> commit: 
>   e29c78a693 ("nfsd: add support for FATTR4_OPEN_ARGUMENTS")
>   8cb33389f6 ("nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
> 
> e29c78a6936e7422 8cb33389f66441dc4e54b28fe0d 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>      24388 ± 20%     -32.8%      16400 ± 18%  numa-vmstat.node0.nr_slab_reclaimable
>      61.50 ±  4%     -10.6%      55.00 ±  6%  perf-c2c.HITM.local
>       0.20 ±  3%     +23.0%       0.24 ± 13%  perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
>       2977            -6.1%       2796        vmstat.system.cs
>    2132466 ±  2%     +81.6%    3871852        fsmark.app_overhead

I have been unable to reproduce this result with fs_mark. I've run a
number of repeated tests, and I can create files just as fast with or
without this patch (roughly ~46 files/s on my test machine).

I'm particularly suspicious of the fsmark.app_overhead value above. The
fsmark output says:

#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

That seems outside the purview of anything we're altering here, so I
have to wonder if something else is going on. Oliver, can you rerun
this test and see if this regression is reproducible?

Thanks,

>      53442           -17.3%      44172        fsmark.time.voluntary_context_switches
>       2907            -5.7%       2742        perf-stat.i.context-switches
>       2902            -5.7%       2737        perf-stat.ps.context-switches
>    1724787            -1.0%    1706808        proc-vmstat.numa_hit
>    1592345            -1.1%    1574310        proc-vmstat.numa_local
>      24.87 ± 33%     -38.9%      15.20 ± 12%  sched_debug.cpu.nr_uninterruptible.max
>       4.36 ±  9%     -17.1%       3.61 ± 10%  sched_debug.cpu.nr_uninterruptible.stddev
>      97541 ± 20%     -32.7%      65610 ± 18%  numa-meminfo.node0.KReclaimable
>      97541 ± 20%     -32.7%      65610 ± 18%  numa-meminfo.node0.SReclaimable
>     256796 ±  9%     -18.7%     208805 ± 13%  numa-meminfo.node0.Slab
>    2307911 ± 52%     +68.5%    3888971 ±  5%  numa-meminfo.node1.MemUsed
>     193326 ± 12%     +24.7%     241049 ± 12%  numa-meminfo.node1.Slab
>       0.90 ± 27%      -0.5        0.36 ±103%  perf-profile.calltrace.cycles-pp.evsel__read_counter.read_counters.process_interval.dispatch_events.cmd_stat
>       0.36 ± 70%      +0.2        0.58 ±  3%  perf-profile.calltrace.cycles-pp.btrfs_commit_transaction.btrfs_sync_file.btrfs_do_write_iter.do_iter_readv_writev.vfs_iter_write
>       0.52 ± 47%      +0.3        0.78 ±  8%  perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
>       1.62 ± 12%      +0.3        1.93 ±  9%  perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>       1.22 ± 21%      -0.3        0.89 ± 10%  perf-profile.children.cycles-pp.readn
>       0.46 ± 32%      -0.2        0.24 ± 34%  perf-profile.children.cycles-pp.__close
>       0.45 ± 32%      -0.2        0.22 ± 15%  perf-profile.children.cycles-pp.__x64_sys_close
>       0.40 ± 29%      -0.2        0.18 ± 38%  perf-profile.children.cycles-pp.__fput
>       0.31 ± 23%      -0.2        0.16 ± 33%  perf-profile.children.cycles-pp.irq_work_tick
>       0.17 ± 51%      -0.1        0.03 ±111%  perf-profile.children.cycles-pp.nfs_file_release
>       0.16 ± 43%      -0.1        0.03 ±111%  perf-profile.children.cycles-pp.__put_nfs_open_context
>       0.26 ± 18%      -0.1        0.15 ± 34%  perf-profile.children.cycles-pp.perf_event_task_tick
>       0.15 ± 41%      -0.1        0.03 ±108%  perf-profile.children.cycles-pp.get_free_pages_noprof
>       0.18 ± 55%      -0.1        0.06 ± 32%  perf-profile.children.cycles-pp.native_apic_mem_eoi
>       0.18 ± 32%      -0.1        0.07 ± 81%  perf-profile.children.cycles-pp.flush_end_io
>       0.17 ± 41%      -0.1        0.07 ± 93%  perf-profile.children.cycles-pp.mas_store_gfp
>       0.52 ±  5%      +0.1        0.58 ±  3%  perf-profile.children.cycles-pp.btrfs_commit_transaction
>       0.02 ±141%      +0.1        0.08 ± 42%  perf-profile.children.cycles-pp.uptime_proc_show
>       0.02 ±141%      +0.1        0.08 ± 44%  perf-profile.children.cycles-pp.get_zeroed_page_noprof
>       0.02 ±141%      +0.1        0.09 ± 35%  perf-profile.children.cycles-pp.__rmqueue_pcplist
>       0.14 ± 12%      +0.1        0.28 ± 29%  perf-profile.children.cycles-pp.hrtimer_next_event_without
>       0.47 ± 27%      +0.2        0.67 ± 19%  perf-profile.children.cycles-pp.__mmap
>       0.70 ± 21%      +0.2        0.91 ±  7%  perf-profile.children.cycles-pp.vfs_write
>       0.74 ± 20%      +0.2        0.96 ±  9%  perf-profile.children.cycles-pp.ksys_write
>       0.73 ± 21%      +0.3        1.00 ±  7%  perf-profile.children.cycles-pp.copy_process
>       1.05 ± 13%      +0.3        1.38 ± 10%  perf-profile.children.cycles-pp.kernel_clone
>       0.28 ± 22%      -0.1        0.13 ± 35%  perf-profile.self.cycles-pp.irq_work_tick
>       0.18 ± 55%      -0.1        0.06 ± 32%  perf-profile.self.cycles-pp.native_apic_mem_eoi
> 
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>