[CC Mike and Mel as they have seen some accounting oddities when doing performance testing. They can share details but essentially the system time just gets too high] For your reference the email thread started http://lkml.kernel.org/r/20160823143330.GL23577@xxxxxxxxxxxxxx I suspect this is mainly for short lived processes - like kernel compile $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.45 System time (seconds): 0.82 Percent of CPU this job got: 111% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14 $ rm mm/mmap.o $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.47 System time (seconds): 1.55 Percent of CPU this job got: 107% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88 This is quite unexpected for a cache hot compile. I would expect most of the time being spent in userspace. $ perf report | grep kernel.vmlinux 2.01% as [kernel.vmlinux] [k] page_fault 0.59% cc1 [kernel.vmlinux] [k] page_fault 0.15% git [kernel.vmlinux] [k] page_fault 0.12% bash [kernel.vmlinux] [k] page_fault 0.11% sh [kernel.vmlinux] [k] page_fault 0.08% gcc [kernel.vmlinux] [k] page_fault 0.06% make [kernel.vmlinux] [k] page_fault 0.04% rm [kernel.vmlinux] [k] page_fault 0.03% ld [kernel.vmlinux] [k] page_fault 0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% cat [kernel.vmlinux] [k] page_fault 0.01% collect2 [kernel.vmlinux] [k] page_fault 0.00% sh [kernel.vmlinux] [k] entry_SYSCALL_64 0.00% rm [kernel.vmlinux] [k] entry_SYSCALL_64 0.00% grep [kernel.vmlinux] [k] page_fault doesn't show anything unexpected. Original Rik's reply follows: On Tue 23-08-16 17:46:11, Rik van Riel wrote: > On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote: [...] > > OK, so it seems I found it. I was quite lucky because > > account_user_time > > is not all that popular function and there were basically no changes > > besides Riks ff9a9b4c4334 ("sched, time: Switch > > VIRT_CPU_ACCOUNTING_GEN > > to jiffy granularity") and that seems to cause the regression. > > Reverting > > the commit on top of the current mmotm seems to fix the issue for me. > > > > And just to give Rik more context. While debugging overhead of the > > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time > > -v > > (source for ./max_mmap is [1]) > > > > root@test1:~# uname -r > > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 > > root@test1:~# ./max_map > > pid:2990 maps:65515 > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps > > rss:263368 pss:262203 > > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" > > User time (seconds): 0.00 > > System time (seconds): 0.45 > > Percent of CPU this job got: 98% > > > > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps > > rss:263316 pss:262199 > > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" > > User time (seconds): 0.18 > > System time (seconds): 0.29 > > Percent of CPU this job got: 97% > > The patch in question makes user and system > time accounting essentially tick-based. If > jiffies changes while the task is in user > mode, time gets accounted as user time, if > jiffies changes while the task is in system > mode, time gets accounted as system time. > > If you get "unlucky", with a job like the > above, it is possible all time gets accounted > to system time. > > This would be true both with the system running > with a periodic timer tick (before and after my > patch is applied), and in nohz_idle mode (after > my patch). > > However, it does seem quite unlikely that you > get zero user time, since you have 125 timer > ticks in half a second. Furthermore, you do not > even have NO_HZ_FULL enabled... > > Does the workload consistently get zero user > time? > > If so, we need to dig further to see under > what precise circumstances that happens. > > On my laptop, with kernel 4.6.3-300.fc24.x86_64 > I get this: > > $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf > "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps > rss:263368 pss:262145 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps" > User time (seconds): 0.64 > System time (seconds): 0.19 > Percent of CPU this job got: 99% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83 > > The main difference between your and my > NO_HZ config seems to be that NO_HZ_FULL > is set here. However, it is not enabled > at run time, so both of our systems > should only really get NO_HZ_IDLE > effectively. > > Running tasks should get sampled with the > regular timer tick, while they are running. > > In other words, vtime accounting should be > disabled in both of our tests, for everything > except the idle task. > > Do I need to do anything special to reproduce > your bug, besides running the max mmap program > and the awk script? > > -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html