On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote: > When steal time exceeds the measured delta when updating clock_task, > we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future things > using > clock_task, as they end up getting additional steal time that did not > actually happen. > > For example, suppose a task in a VM runs for 10ms and had 15ms of > steal > time reported while it ran. clock_task rightly doesn't advance. Then, > a > different taks runs on the same rq for 10ms without any time stolen > in > the host. > Because of the current catch up mechanism, clock_sched inaccurately > ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > This can result in tasks getting more run time than they should > actually > get. > > So, we instead don't make future updates pay back past excess stolen > time. > > Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> > --- > v2: > - Slightly changed to simply moving one line up instead of adding > new variable. > > v1: > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@xxxxxxxxxx > --- > kernel/sched/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f3951e4a55e5..6c34de8b3fbb 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, > s64 delta) > if (static_key_false((¶virt_steal_rq_enabled))) { > steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > + rq->prev_steal_time_rq += steal; The above two lines are essentially: steal -= prev; prev += steal; It's like one of those clever ways of exchanging two variables with three XOR operations. I don't like it :) Ultimately, you're just setting rq->prev_steal_time_rq to the latest value that you just read from paravirt_steal_clock(). And then setting 'steal' to the delta between the new reading and the previous one. The code above is *far* from obvious. At the very least it wants a comment, but I'd rather see it refactored so that it doesn't need one. u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq)); steal = abs_steal_now - rq->prev_steal_time_rq; rq->prev_steal_time_rq = abs_steal_now; I'm still not utterly convinced this is the right thing to do, though. It means you will constantly undermeasure the accounting of steal time as you discard the excess each time. The underlying bug here is that we are sampling the steal time and the time slice at *different* times. This update_rq_clock_task() function could be called with a calculated 'delta' argument... and then experience a large amount of steal time *before* calling paravirt_steal_clock(), which is how we end up in the situation where the calculated steal time exceeds the running time of the previous task. Which task *should* that steal time be accounted to? At the moment it ends up being accounted to the next task to run — which seems to make sense to me. In the situation I just described, we can consider the time stolen in update_rq_clock_task() itself to have been stolen from the *incoming* task, not the *outgoing* one. But that seems to be what you're objecting to? In https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@xxxxxxxxxxxxx/ I put a limit on the amount of steal time carried forward from one timeslice to the next, as it was misbehaving when a bad hypervisor reported negative steal time. But I don't think the limit should be zero. Of course, *ideally* we'd be able to sample the time and steal time *simultaneously*, with a single sched_clock_cpu_and_steal() function so that we don't have to deal with this slop between readings. Then none of this would be necessary. But that seems hard.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature