On 7/23/18 5:21 PM, Peter Zijlstra wrote: > On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote: >> The trace data corresponds to the last sample period: >> trace entry 1: >> cat-20755 [022] d... 1370.106496: cputime_adjust: task >> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626 >> cat-20755 [022] d... 1370.106497: cputime_adjust: result: >> old utime 330729718142 stime 2306983867, new utime 330733635372 stime >> 2327067254 >> >> trace entry 2: >> cat-20773 [005] d... 1371.109825: cputime_adjust: task >> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912 >> cat-20773 [005] d... 1371.109826: cputime_adjust: result: >> old utime 330733635372 stime 2327067254, new utime 330827229702 stime >> 3236489210 >> >> 1) expected behaviour >> Let's compare the last two trace entries(all the data below is in ns): >> task tick-based utime: 362560000000->362567000000 increased 7000000 >> task tick-based stime: 2551000000 ->3547000000 increased 996000000 >> scheduler rtime: 333060702626->334063718912 increased 1003016286 >> >> The application actually runs almost 100%sys at the moment, we can >> use the task tick-based utime and stime increased to double check: >> 996000000/(7000000+996000000) > 99%sys >> >> 2) the current cputime_adjust() inaccurate result >> But for the current cputime_adjust(), we get the following adjusted >> utime and stime increase in this sample period: >> adjusted utime: 330733635372->330827229702 increased 93594330 >> adjusted stime: 2327067254 ->3236489210 increased 909421956 >> >> so 909421956/(93594330+909421956)=91%sys as the shell script shows above. >> >> 3) root cause >> The root cause of the issue is that the current cputime_adjust() always >> passes the whole times to scale_stime() to split the whole utime and >> stime. In this patch, we pass all the increased deltas in 1) within >> user's sample period to scale_stime() instead and accumulate the >> corresponding results to the previous saved adjusted utime and stime, >> so guarantee the accurate usr and sys increase within the user sample >> period. > > But why it this a problem? > > Since its sample based there's really nothing much you can guarantee. > What if your test program were to run in userspace for 50% of the time > but is so constructed to always be in kernel space when the tick > happens? > > Then you would 'expect' it to be 50% user and 50% sys, but you're also > not getting that. > > This stuff cannot be perfect, and the current code provides 'sensible' > numbers over the long run for most programs. Why muck with it? > Basically I am ok with the current implementation, except for one scenario we've met: when kernel went wrong for some reason with 100% sys suddenly for seconds(even trigger softlockup), the statistics monitor didn't reflect the fact, which confused people. One example with our per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in fact the kernel was in some busy loop(100% sys) at that moment, and the tick based time are of course all sys samples in such case. -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html