On Mon, Nov 18, 2024 at 01:37:45PM +0900, Suleiman Souhlal wrote: > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future things using > clock_task, in some situations, as they end up getting additional steal > time that did not actually happen. > This is because there is a window between reading the elapsed time in > update_rq_clock() and sampling the steal time in update_rq_clock_task(). > If the VCPU gets preempted between those two points, any additional > steal time is accounted to the outgoing task even though the calculated > delta did not actually contain any of that "stolen" time. > When this race happens, we can end up with steal time that exceeds the > calculated delta, and the previous code would try to catch up that excess > steal time in future clock updates, which is given to the next, > incoming task, even though it did not actually have any time stolen. > > This behavior is particularly bad when steal time can be very long, > which we've seen when trying to extend steal time to contain the duration > that the host was suspended [0]. When this happens, clock_task stays > frozen, during which the running task stays running for the whole > duration, since its run time doesn't increase. > However the race can happen even under normal operation. > > Ideally we would read the elapsed cpu time and the steal time atomically, > to prevent this race from happening in the first place, but doing so > is non-trivial. > > Since the time between those two points isn't otherwise accounted anywhere, > neither to the outgoing task nor the incoming task (because the "end of > outgoing task" and "start of incoming task" timestamps are the same), > I would argue that the right thing to do is to simply drop any excess steal > time, in order to prevent these issues. > > [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@xxxxxxxxxx/ > > Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> Right.. uhm.. I don't particularly care much either way. Are other people with virt clue okay with this? > --- > kernel/sched/core.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a1c353a62c56..13f70316ef39 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > #endif > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > if (static_key_false((¶virt_steal_rq_enabled))) { > - steal = paravirt_steal_clock(cpu_of(rq)); > + u64 prev_steal; > + > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > > if (unlikely(steal > delta)) > steal = delta; > > - rq->prev_steal_time_rq += steal; > + rq->prev_steal_time_rq = prev_steal; > delta -= steal; > } > #endif > -- > 2.47.0.338.g60cca15819-goog >