On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote: > On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote: > > On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote: > > > When steal time exceeds the measured delta when updating clock_task, > > > we > > > currently try to catch up the excess in future updates. > > > However, this results in inaccurate run times for the future things > > > using > > > clock_task, as they end up getting additional steal time that did not > > > actually happen. > > > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of > > > steal > > > time reported while it ran. clock_task rightly doesn't advance. Then, > > > a > > > different taks runs on the same rq for 10ms without any time stolen > > > in > > > the host. > > > Because of the current catch up mechanism, clock_sched inaccurately > > > ends > > > up advancing by only 5ms instead of 10ms even though there wasn't any > > > actual time stolen. The second task is getting charged for less time > > > than it ran, even though it didn't deserve it. > > > This can result in tasks getting more run time than they should > > > actually > > > get. > > > > > > So, we instead don't make future updates pay back past excess stolen > > > time. > > > > > > Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> > > > --- > > > v2: > > > - Slightly changed to simply moving one line up instead of adding > > > new variable. > > > > > > v1: > > > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@xxxxxxxxxx > > > --- > > > kernel/sched/core.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index f3951e4a55e5..6c34de8b3fbb 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, > > > s64 delta) > > > if (static_key_false((¶virt_steal_rq_enabled))) { > > > steal = paravirt_steal_clock(cpu_of(rq)); > > > steal -= rq->prev_steal_time_rq; > > > + rq->prev_steal_time_rq += steal; > > > > The above two lines are essentially: > > > > steal -= prev; > > prev += steal; > > > > It's like one of those clever ways of exchanging two variables with > > three XOR operations. I don't like it :) > > > > Ultimately, you're just setting rq->prev_steal_time_rq to the latest > > value that you just read from paravirt_steal_clock(). And then setting > > 'steal' to the delta between the new reading and the previous one. > > > > The code above is *far* from obvious. At the very least it wants a > > comment, but I'd rather see it refactored so that it doesn't need one. > > > > u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq)); > > steal = abs_steal_now - rq->prev_steal_time_rq; > > rq->prev_steal_time_rq = abs_steal_now; > > That is what v1 did: > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@xxxxxxxxxx/ > > It is also more obvious to me, but the feedback I received was that > the way in the current iteration is better. > > I don't feel strongly about it, and I'd be ok with either version applied. Fair enough. Not really a hill anyone should choose to die on, I suppose. > > > > I'm still not utterly convinced this is the right thing to do, though. > > It means you will constantly undermeasure the accounting of steal time > > as you discard the excess each time. > > > > The underlying bug here is that we are sampling the steal time and the > > time slice at *different* times. This update_rq_clock_task() function > > could be called with a calculated 'delta' argument... and then > > experience a large amount of steal time *before* calling > > paravirt_steal_clock(), which is how we end up in the situation where > > the calculated steal time exceeds the running time of the previous > > task. > > > > Which task *should* that steal time be accounted to? At the moment it > > ends up being accounted to the next task to run — which seems to make > > sense to me. In the situation I just described, we can consider the > > time stolen in update_rq_clock_task() itself to have been stolen from > > the *incoming* task, not the *outgoing* one. But that seems to be what > > you're objecting to? > > This is a good description of the problem, except that the time stolen > in update_rq_clock_task() itself is actually being stolen from the > outgoing task. This is because we are still trying to calculate how long > it ran for (update_curr()), and time hasn't started ticking for the > incoming task yet. We haven't set the incoming task's exec_start with the > new clock_task time yet. > > So, in my opinion, it's wrong to give that time to the incoming task. That makes sense. That steal time is actually stolen from *neither* task, since it's after the 'end' timestamp of the outgoing task, and before the 'start' timestamp of the incoming task. So where *should* it be accounted? Or is it actually correct to drop it completely? If you can make a coherent case for the fact that dropping it is really the right thing to do (not *just* that it doesn't belong to the outgoing task, which is the case you make in your existing commit message), then I suppose I'm OK with your patch as-is. >
Attachment:
smime.p7s
Description: S/MIME cryptographic signature