Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 30 Sep 2016 11:49:05 +0200

[CC Mike and Mel as they have seen some accounting oddities
 when doing performance testing. They can share details but
 essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.GL23577@xxxxxxxxxxxxxx

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
        User time (seconds): 0.45
        System time (seconds): 0.82
        Percent of CPU this job got: 111%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
        User time (seconds): 0.47
        System time (seconds): 1.55
        Percent of CPU this job got: 107%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
     2.01%  as        [kernel.vmlinux]         [k] page_fault
     0.59%  cc1       [kernel.vmlinux]         [k] page_fault
     0.15%  git       [kernel.vmlinux]         [k] page_fault
     0.12%  bash      [kernel.vmlinux]         [k] page_fault
     0.11%  sh        [kernel.vmlinux]         [k] page_fault
     0.08%  gcc       [kernel.vmlinux]         [k] page_fault
     0.06%  make      [kernel.vmlinux]         [k] page_fault
     0.04%  rm        [kernel.vmlinux]         [k] page_fault
     0.03%  ld        [kernel.vmlinux]         [k] page_fault
     0.02%  bash      [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.01%  git       [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.01%  cat       [kernel.vmlinux]         [k] page_fault
     0.01%  collect2  [kernel.vmlinux]         [k] page_fault
     0.00%  sh        [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.00%  rm        [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.00%  grep      [kernel.vmlinux]         [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> > 
> > And just to give Rik more context. While debugging overhead of the
> > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> > 
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> >         User time (seconds): 0.00
> >         System time (seconds): 0.45
> >         Percent of CPU this job got: 98%
> > 
> 
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> >         User time (seconds): 0.18
> >         System time (seconds): 0.29
> >         Percent of CPU this job got: 97%
> 
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
> 
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
> 
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
> 
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
> 
> Does the workload consistently get zero user
> time?
> 
> If so, we need to dig further to see under
> what precise circumstances that happens.
> 
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
> 
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
> 	Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
> 	User time (seconds): 0.64
> 	System time (seconds): 0.19
> 	Percent of CPU this job got: 99%
> 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83
> 
> The main difference between your and my
> NO_HZ config seems to be that NO_HZ_FULL
> is set here. However, it is not enabled
> at run time, so both of our systems
> should only really get NO_HZ_IDLE
> effectively.
> 
> Running tasks should get sampled with the
> regular timer tick, while they are running.
> 
> In other words, vtime accounting should be
> disabled in both of our tests, for everything
> except the idle task.
> 
> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?
> 
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html