On 5/22/24 5:47 AM, David Woodhouse wrote: > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > In steal_account_process_time(), a delta is calculated between the value > returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which > is assumed to be the *previous* value returned by paravirt_steal_clock(). > > However, instead of just assigning the newly-read value directly into > ->prev_steal_time for use in the next iteration, ->prev_steal_time is > *incremented* by the calculated delta. Does this happen because ULONG_MAX and u64 aren't of same size? If so, would using the u64 variant of MAX would be a simpler fix? > > This used to be roughly the same, modulo conversion to jiffies and back, > until commit 807e5b80687c0 ("sched/cputime: Add steal time support to > full dynticks CPU time accounting") started clamping that delta to a > maximum of the actual time elapsed. So now, if the value returned by > paravirt_steal_clock() jumps by a large amount, instead of a *single* > period of reporting 100% steal time, the system will report 100% steal > time for as long as it takes to "catch up" with the reported value. > Which is up to 584 years. > > But there is a benefit to advancing ->prev_steal_time only by the time > which was *accounted* as having been stolen. It means that any extra > time truncated by the clamping will be accounted in the next sample > period rather than lost. Given the stochastic nature of the sampling, > that is more accurate overall. > > So, continue to advance ->prev_steal_time by the accounted value as > long as the delta isn't egregiously large (for which, use maxtime * 2). > If the delta is more than that, just set ->prev_steal_time directly to > the value returned by paravirt_steal_clock(). > > Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting") > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > --- > kernel/sched/cputime.c | 20 ++++++++++++++------ > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index af7952f12e6c..3a8a8b38966d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime) > { > #ifdef CONFIG_PARAVIRT > if (static_key_false(¶virt_steal_enabled)) { > - u64 steal; > - > - steal = paravirt_steal_clock(smp_processor_id()); > - steal -= this_rq()->prev_steal_time; > - steal = min(steal, maxtime); > + u64 steal, abs_steal; > + > + abs_steal = paravirt_steal_clock(smp_processor_id()); > + steal = abs_steal - this_rq()->prev_steal_time; > + if (unlikely(steal > maxtime)) { > + /* > + * If the delta isn't egregious, it can be counted > + * in the next time period. Only advance by maxtime. > + */ > + if (steal < maxtime * 2) > + abs_steal = this_rq()->prev_steal_time + maxtime; > + steal = maxtime; > + } > account_steal_time(steal); > - this_rq()->prev_steal_time += steal; > + this_rq()->prev_steal_time = abs_steal; > > return steal; > }