Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Mon, 23 Jul 2018 11:21:18 +0200

On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
> The trace data corresponds to the last sample period:
> trace entry 1:
>              cat-20755 [022] d...  1370.106496: cputime_adjust: task
> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
>              cat-20755 [022] d...  1370.106497: cputime_adjust: result:
> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
> 2327067254
> 
> trace entry 2:
>              cat-20773 [005] d...  1371.109825: cputime_adjust: task
> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
>              cat-20773 [005] d...  1371.109826: cputime_adjust: result:
> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
> 3236489210
> 
> 1) expected behaviour
> Let's compare the last two trace entries(all the data below is in ns):
> task tick-based utime: 362560000000->362567000000 increased 7000000
> task tick-based stime: 2551000000  ->3547000000   increased 996000000
> scheduler rtime:       333060702626->334063718912 increased 1003016286
> 
> The application actually runs almost 100%sys at the moment, we can
> use the task tick-based utime and stime increased to double check:
> 996000000/(7000000+996000000) > 99%sys
> 
> 2) the current cputime_adjust() inaccurate result
> But for the current cputime_adjust(), we get the following adjusted
> utime and stime increase in this sample period:
> adjusted utime: 330733635372->330827229702 increased 93594330
> adjusted stime: 2327067254  ->3236489210   increased 909421956
> 
> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
> 
> 3) root cause
> The root cause of the issue is that the current cputime_adjust() always
> passes the whole times to scale_stime() to split the whole utime and
> stime. In this patch, we pass all the increased deltas in 1) within
> user's sample period to scale_stime() instead and accumulate the
> corresponding results to the previous saved adjusted utime and stime,
> so guarantee the accurate usr and sys increase within the user sample
> period.

But why it this a problem?

Since its sample based there's really nothing much you can guarantee.
What if your test program were to run in userspace for 50% of the time
but is so constructed to always be in kernel space when the tick
happens?

Then you would 'expect' it to be 50% user and 50% sys, but you're also
not getting that.

This stuff cannot be perfect, and the current code provides 'sensible'
numbers over the long run for most programs. Why muck with it?
--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html