On Tue, Aug 06, 2024 at 06:51:36PM -0400, Joel Fernandes wrote: > On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@xxxxxxxxxx> wrote: > > > > When steal time exceeds the measured delta when updating clock_task, we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future clock_task > > measurements, as they end up getting additional steal time that did not > > actually happen, from the previous excess steal time being paid back. > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > > time reported while it ran. clock_task rightly doesn't advance. Then, a > > different task runs on the same rq for 10ms without any time stolen. > > Because of the current catch up mechanism, clock_sched inaccurately ends > > up advancing by only 5ms instead of 10ms even though there wasn't any > > actual time stolen. The second task is getting charged for less time > > than it ran, even though it didn't deserve it. > > In other words, tasks can end up getting more run time than they should > > actually get. > > > > So, we instead don't make future updates pay back past excess stolen time. > > > > Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> > > --- > > kernel/sched/core.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index bcf2c4cc0522..42b37da2bda6 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > > #endif > > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > > if (static_key_false((¶virt_steal_rq_enabled))) { > > - steal = paravirt_steal_clock(cpu_of(rq)); > > + u64 prev_steal; > > + > > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > > steal -= rq->prev_steal_time_rq; > > > > if (unlikely(steal > delta)) > > steal = delta; > > > > - rq->prev_steal_time_rq += steal; > > + rq->prev_steal_time_rq = prev_steal; > > delta -= steal; > > Makes sense, but wouldn't this patch also do the following: If vCPU > task is the only one running and has a large steal time, then > sched_tick() will only freeze the clock for a shorter period, and not > give future credits to the vCPU task itself? Maybe it does not matter > (and I probably don't understand the code enough) but thought I would > mention. The patch should still be doing the right thing in that situation: The clock will be frozen for the whole duration that it ran, and delta will be 0. The current excess amount is not relevant to the future, as far as I can tell. The pre-patch code is giving the rq extra time that it hadn't measured. I don't really see why it should be getting that extra time. > > I am also not sure if the purpose of stealtime is to credit individual > tasks, or rather all tasks on the runqueue because the "whole > runqueue" had time stolen.. No where in this function is it dealing > with individual tasks but rather the rq itself. This function is used to update clock_task, which *is* relevant to individual tasks. It is used to calculate how long tasks ran for (and for load averages). -- Suleiman