On Mon, 2024-11-18 at 10:34 +0100, Peter Zijlstra wrote: > On Mon, Nov 18, 2024 at 01:37:45PM +0900, Suleiman Souhlal wrote: > > When steal time exceeds the measured delta when updating clock_task, we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future things using > > clock_task, in some situations, as they end up getting additional steal > > time that did not actually happen. > > This is because there is a window between reading the elapsed time in > > update_rq_clock() and sampling the steal time in update_rq_clock_task(). > > If the VCPU gets preempted between those two points, any additional > > steal time is accounted to the outgoing task even though the calculated > > delta did not actually contain any of that "stolen" time. > > When this race happens, we can end up with steal time that exceeds the > > calculated delta, and the previous code would try to catch up that excess > > steal time in future clock updates, which is given to the next, > > incoming task, even though it did not actually have any time stolen. > > > > This behavior is particularly bad when steal time can be very long, > > which we've seen when trying to extend steal time to contain the duration > > that the host was suspended [0]. When this happens, clock_task stays > > frozen, during which the running task stays running for the whole > > duration, since its run time doesn't increase. > > However the race can happen even under normal operation. > > > > Ideally we would read the elapsed cpu time and the steal time atomically, > > to prevent this race from happening in the first place, but doing so > > is non-trivial. > > > > Since the time between those two points isn't otherwise accounted anywhere, > > neither to the outgoing task nor the incoming task (because the "end of > > outgoing task" and "start of incoming task" timestamps are the same), > > I would argue that the right thing to do is to simply drop any excess steal > > time, in order to prevent these issues. > > > > [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@xxxxxxxxxx/ > > > > Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> > > Right.. uhm.. I don't particularly care much either way. Are other > people with virt clue okay with this? I'm slightly dubious because now we may systemically lose accounted steal time where before it was all at least accounted *somewhere*, and we might reasonably have expected the slight inaccuracies to balance out over time. But this *does* fix the main problem I was seeing, that the kernel will currently just keep attributing steal time to processes *forever* if a buggy hypervisor lets it step backwards. So I can live with it. Acked-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
Attachment:
smime.p7s
Description: S/MIME cryptographic signature