Re: [cel:nfsd-testing] [nfsd] ab62726202: fsmark.app_overhead 186.4% regression

Jeff Layton <jlayton@xxxxxxxxxx> · Sat, 28 Dec 2024 13:17:47 -0500

On Sat, 2024-12-28 at 12:32 -0500, Chuck Lever wrote:
> On 12/27/24 4:13 AM, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a 186.4% regression of fsmark.app_overhead on:
> > (but no diff for fsmark.files_per_sec as below (a))
> 
> Hello Oliver -
> 
> I have questions about this test result.
> 
> Is this https://github.com/josefbacik/fs_mark ?
> 
> I don't understand what "app_overhead" is measuring. Is this "think
> time"?
> 
> A more concerning regression might be:
> 
>         13.03 ±170%    +566.0%      86.78 ± 77%
> 
> perf-sched.wait_and_delay.avg.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.svc_tcp_sendto
> 
> But these metrics look like they improved:
> 
>          0.03 ± 56%     -73.4%       0.01 ±149%
> perf-sched.sch_delay.avg.ms.btrfs_commit_transaction.btrfs_sync_file.nfsd_commit.nfsd4_commit
>          0.05 ± 60%     -72.1%       0.02 ±165%
> perf-sched.sch_delay.max.ms.btrfs_commit_transaction.btrfs_sync_file.nfsd_commit.nfsd4_commit
> 
> This is a quite mixed result, IMO -- I'm not convinced it's actionable.
> Can someone help explain/analyze the metrics?
> 

I've spent some time staring at fs_mark [1] with the similar regression
reported due to the OPEN_XOR_DELEGATION patches.

It forks a bunch of processes and does a bunch of create/write/fsync,
etc. in a directory hierarchy, based on the command-line options.

"App overhead" in that program consists of this:

        total_file_ops =
            creat_usec + total_write_usec + fsync_usec + avg_sync_usec +
            close_usec;
        app_overhead_usec = loop_usecs - total_file_ops;

IOW, it measures the timing of various calls into the kernel and
tallies the time they took. It then subtracts the total of those from
the total time in the loop, and that's the "App overhead".

The problem I had was that the "App overhead" covers quite a bit of the
code (some of which does kernel syscalls too!). My effort at bracketing
where the increase came from didn't give me reliable results. I meant
to get back to it, but haven't had the time recently.

[1]: https://github.com/josefbacik/fs_mark

> 
> > commit: ab627262022ed8c6a68e619ed03a14e47acf2e39 ("nfsd: allocate new session-based DRC slots on demand.")
> > https://git.kernel.org/cgit/linux/kernel/git/cel/linux nfsd-testing
> > 
> > testcase: fsmark
> > config: x86_64-rhel-9.4
> > compiler: gcc-12
> > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > parameters:
> > 
> > 	iterations: 1x
> > 	nr_threads: 32t
> > 	disk: 1HDD
> > 	fs: btrfs
> > 	fs2: nfsv4
> > 	filesize: 16MB
> > 	test_size: 15G
> > 	sync_method: NoSync
> > 	nr_directories: 16d
> > 	nr_files_per_directory: 256fpd
> > 	cpufreq_governor: performance
> > 
> > 
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > Closes: https://lore.kernel.org/oe-lkp/202412271641.cfba5666-lkp@xxxxxxxxx
> > 
> > 
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> > 
> > 
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20241227/202412271641.cfba5666-lkp@xxxxxxxxx
> > 
> > =========================================================================================
> > compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_directories/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
> >    gcc-12/performance/1HDD/16MB/nfsv4/btrfs/1x/x86_64-rhel-9.4/16d/256fpd/32t/debian-12-x86_64-20240206.cgz/NoSync/lkp-icl-2sp7/15G/fsmark
> > 
> > commit:
> >    ccd01c7601 ("nfsd: add session slot count to /proc/fs/nfsd/clients/*/info")
> >    ab62726202 ("nfsd: allocate new session-based DRC slots on demand.")
> > 
> > ccd01c76017847d1 ab627262022ed8c6a68e619ed03
> > ---------------- ---------------------------
> >           %stddev     %change         %stddev
> >               \          |                \
> >        5.48 ±  9%     +24.9%       6.85 ± 14%  sched_debug.cpu.nr_uninterruptible.stddev
> >       12489           +11.1%      13876        uptime.idle
> >   3.393e+08 ± 16%    +186.4%  9.717e+08 ±  9%  fsmark.app_overhead
> >        6.40            +0.0%       6.40        fsmark.files_per_sec     <-------- (a)
> >        6.00           +27.8%       7.67 ±  6%  fsmark.time.percent_of_cpu_this_job_got
> >       72.33           +15.8%      83.79        iostat.cpu.idle
> >       25.91 ±  3%     -44.3%      14.42 ± 11%  iostat.cpu.iowait
> >       72.08           +11.6       83.64        mpstat.cpu.all.idle%
> >       26.18 ±  3%     -11.6       14.58 ± 11%  mpstat.cpu.all.iowait%
> >      153772 ±  5%     +19.1%     183126 ±  8%  meminfo.DirectMap4k
> >      156099           +19.5%     186594        meminfo.Dirty
> >      467358           -12.9%     406910 ±  2%  meminfo.Writeback
> >       72.35           +15.8%      83.79        vmstat.cpu.id
> >       25.90 ±  3%     -44.3%      14.41 ± 11%  vmstat.cpu.wa
> >       17.61 ±  3%     -45.8%       9.55 ± 10%  vmstat.procs.b
> >        5909 ±  2%      -6.2%       5545        vmstat.system.in
> >        0.03 ± 56%     -73.4%       0.01 ±149%  perf-sched.sch_delay.avg.ms.btrfs_commit_transaction.btrfs_sync_file.nfsd_commit.nfsd4_commit
> >        0.05 ± 60%     -72.1%       0.02 ±165%  perf-sched.sch_delay.max.ms.btrfs_commit_transaction.btrfs_sync_file.nfsd_commit.nfsd4_commit
> >        0.07 ± 41%     +36.1%       0.10 ±  8%  perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.btrfs_tree_read_lock_nested
> >       13.03 ±170%    +566.0%      86.78 ± 77%  perf-sched.wait_and_delay.avg.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.svc_tcp_sendto
> >      206.83 ± 14%     -31.5%     141.67 ±  6%  perf-sched.wait_and_delay.count.rpc_wait_bit_killable.__wait_on_bit.out_of_line_wait_on_bit.__rpc_execute
> >        0.30 ± 62%     -82.1%       0.05 ±110%  perf-sched.wait_time.avg.ms.handle_reserve_ticket.__reserve_bytes.btrfs_reserve_data_bytes.btrfs_check_data_free_space
> >        7.37 ±  4%     -15.8%       6.20 ±  4%  perf-stat.i.MPKI
> >       44.13 ±  2%      -2.9       41.25 ±  2%  perf-stat.i.cache-miss-rate%
> >      103.65 ±  2%     +17.9%     122.17 ±  8%  perf-stat.i.cpu-migrations
> >      627.67 ±  3%     +25.4%     787.18 ±  6%  perf-stat.i.cycles-between-cache-misses
> >        0.67            +3.7%       0.70        perf-stat.i.ipc
> >        1.35            +2.2%       1.38        perf-stat.overall.cpi
> >      373.39            +4.1%     388.79        perf-stat.overall.cycles-between-cache-misses
> >        0.74            -2.1%       0.73        perf-stat.overall.ipc
> >      102.89 ±  2%     +17.9%     121.32 ±  8%  perf-stat.ps.cpu-migrations
> >       39054           +19.0%      46460 ±  2%  proc-vmstat.nr_dirty
> >       15139            +2.2%      15476        proc-vmstat.nr_kernel_stack
> >       45710            +1.9%      46570        proc-vmstat.nr_slab_unreclaimable
> >      116900           -13.5%     101162        proc-vmstat.nr_writeback
> >       87038           -18.2%      71185 ±  2%  proc-vmstat.nr_zone_write_pending
> >     6949807            -3.8%    6688660        proc-vmstat.numa_hit
> >     6882153            -3.8%    6622312        proc-vmstat.numa_local
> >    13471776            -2.0%   13204489        proc-vmstat.pgalloc_normal
> >      584292            +3.8%     606391 ±  3%  proc-vmstat.pgfault
> >       25859            +9.8%      28392 ±  9%  proc-vmstat.pgreuse
> >        2.02 ±  8%      -0.3        1.71 ±  5%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
> >        1.86 ±  8%      -0.3        1.58 ±  6%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
> >        3.42 ±  5%      -0.6        2.87 ±  5%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
> >        2.96 ±  4%      -0.4        2.55 ±  5%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
> >        0.35 ± 45%      -0.2        0.14 ± 71%  perf-profile.children.cycles-pp.khugepaged
> >        0.34 ± 46%      -0.2        0.14 ± 71%  perf-profile.children.cycles-pp.hpage_collapse_scan_pmd
> >        0.34 ± 46%      -0.2        0.14 ± 71%  perf-profile.children.cycles-pp.khugepaged_scan_mm_slot
> >        0.34 ± 47%      -0.2        0.14 ± 72%  perf-profile.children.cycles-pp.collapse_huge_page
> >        1.21 ± 10%      -0.2        1.01 ±  8%  perf-profile.children.cycles-pp.__hrtimer_run_queues
> >        0.82 ±  9%      -0.1        0.68 ± 10%  perf-profile.children.cycles-pp.update_process_times
> >        0.41 ±  8%      -0.1        0.29 ± 22%  perf-profile.children.cycles-pp.btrfs_check_data_free_space
> >        0.21 ±  7%      -0.1        0.11 ± 73%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
> >        0.55 ± 11%      -0.1        0.46 ± 14%  perf-profile.children.cycles-pp.__set_extent_bit
> >        0.33 ±  9%      -0.1        0.28 ±  8%  perf-profile.children.cycles-pp.nfs_request_add_commit_list
> >        0.17 ±  9%      -0.0        0.13 ± 16%  perf-profile.children.cycles-pp.readn
> >        0.08 ± 13%      -0.0        0.06 ± 14%  perf-profile.children.cycles-pp.load_elf_interp
> >        1.00 ± 16%      +1.2        2.18 ± 53%  perf-profile.children.cycles-pp.folio_batch_move_lru
> >        0.21 ±  8%      -0.1        0.11 ± 73%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
> >        0.05 ± 49%      +0.1        0.15 ± 61%  perf-profile.self.cycles-pp.nfs_update_folio
> >        0.94 ±  5%      +0.2        1.11 ±  4%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
> >        0.25 ± 17%      +0.4        0.63 ± 61%  perf-profile.self.cycles-pp.nfs_page_async_flush
> > 
> > 
> > 
> > 
> > Disclaimer:
> > Results have been estimated based on internal Intel analysis and are provided
> > for informational purposes only. Any difference in system hardware or software
> > design or configuration may affect actual performance.
> > 
> > 
> 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>