On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote: > On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote: > > When steal time exceeds the measured delta when updating clock_task, > > we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future things > > using > > clock_task, as they end up getting additional steal time that did not > > actually happen. > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of > > steal > > time reported while it ran. clock_task rightly doesn't advance. Then, > > a > > different taks runs on the same rq for 10ms without any time stolen > > in > > the host. > > Because of the current catch up mechanism, clock_sched inaccurately > > ends > > up advancing by only 5ms instead of 10ms even though there wasn't any > > actual time stolen. The second task is getting charged for less time > > than it ran, even though it didn't deserve it. > > This can result in tasks getting more run time than they should > > actually > > get. > > > > So, we instead don't make future updates pay back past excess stolen > > time. > > > > Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> > > --- > > v2: > > - Slightly changed to simply moving one line up instead of adding > > new variable. > > > > v1: > > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@xxxxxxxxxx > > --- > > kernel/sched/core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index f3951e4a55e5..6c34de8b3fbb 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, > > s64 delta) > > if (static_key_false((¶virt_steal_rq_enabled))) { > > steal = paravirt_steal_clock(cpu_of(rq)); > > steal -= rq->prev_steal_time_rq; > > + rq->prev_steal_time_rq += steal; > > The above two lines are essentially: > > steal -= prev; > prev += steal; > > It's like one of those clever ways of exchanging two variables with > three XOR operations. I don't like it :) > > Ultimately, you're just setting rq->prev_steal_time_rq to the latest > value that you just read from paravirt_steal_clock(). And then setting > 'steal' to the delta between the new reading and the previous one. > > The code above is *far* from obvious. At the very least it wants a > comment, but I'd rather see it refactored so that it doesn't need one. > > u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq)); > steal = abs_steal_now - rq->prev_steal_time_rq; > rq->prev_steal_time_rq = abs_steal_now; That is what v1 did: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@xxxxxxxxxx/ It is also more obvious to me, but the feedback I received was that the way in the current iteration is better. I don't feel strongly about it, and I'd be ok with either version applied. > > I'm still not utterly convinced this is the right thing to do, though. > It means you will constantly undermeasure the accounting of steal time > as you discard the excess each time. > > The underlying bug here is that we are sampling the steal time and the > time slice at *different* times. This update_rq_clock_task() function > could be called with a calculated 'delta' argument... and then > experience a large amount of steal time *before* calling > paravirt_steal_clock(), which is how we end up in the situation where > the calculated steal time exceeds the running time of the previous > task. > > Which task *should* that steal time be accounted to? At the moment it > ends up being accounted to the next task to run — which seems to make > sense to me. In the situation I just described, we can consider the > time stolen in update_rq_clock_task() itself to have been stolen from > the *incoming* task, not the *outgoing* one. But that seems to be what > you're objecting to? This is a good description of the problem, except that the time stolen in update_rq_clock_task() itself is actually being stolen from the outgoing task. This is because we are still trying to calculate how long it ran for (update_curr()), and time hasn't started ticking for the incoming task yet. We haven't set the incoming task's exec_start with the new clock_task time yet. So, in my opinion, it's wrong to give that time to the incoming task. > > In > https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@xxxxxxxxxxxxx/ > I put a limit on the amount of steal time carried forward from one > timeslice to the next, as it was misbehaving when a bad hypervisor > reported negative steal time. But I don't think the limit should be > zero. > > Of course, *ideally* we'd be able to sample the time and steal time > *simultaneously*, with a single sched_clock_cpu_and_steal() function so > that we don't have to deal with this slop between readings. Then none > of this would be necessary. But that seems hard. I agree that that would be ideal. Thanks, -- Suleiman