Re: [PATCH v2] sched: Don't try to catch up excess steal time.

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 25 Sep 2024 12:45:55 +0100

On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> When steal time exceeds the measured delta when updating clock_task,
> we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things
> using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of
> steal
> time reported while it ran. clock_task rightly doesn't advance. Then,
> a
> different taks runs on the same rq for 10ms without any time stolen
> in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately
> ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should
> actually
> get.
> 
> So, we instead don't make future updates pay back past excess stolen
> time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1:
> https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@xxxxxxxxxx
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f3951e4a55e5..6c34de8b3fbb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> s64 delta)
>         if (static_key_false((&paravirt_steal_rq_enabled))) {
>                 steal = paravirt_steal_clock(cpu_of(rq));
>                 steal -= rq->prev_steal_time_rq;
> +               rq->prev_steal_time_rq += steal;

The above two lines are essentially:

	steal -= prev;
	prev += steal;

It's like one of those clever ways of exchanging two variables with
three XOR operations. I don't like it :)

Ultimately, you're just setting rq->prev_steal_time_rq to the latest
value that you just read from paravirt_steal_clock(). And then setting
'steal' to the delta between the new reading and the previous one.

The code above is *far* from obvious. At the very least it wants a
comment, but I'd rather see it refactored so that it doesn't need one. 

    u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
    steal = abs_steal_now - rq->prev_steal_time_rq;
    rq->prev_steal_time_rq = abs_steal_now;

I'm still not utterly convinced this is the right thing to do, though.
It means you will constantly undermeasure the accounting of steal time
as you discard the excess each time.

The underlying bug here is that we are sampling the steal time and the
time slice at *different* times. This update_rq_clock_task() function
could be called with a calculated 'delta' argument... and then
experience a large amount of steal time *before* calling
paravirt_steal_clock(), which is how we end up in the situation where
the calculated steal time exceeds the running time of the previous
task.

Which task *should* that steal time be accounted to? At the moment it
ends up being accounted to the next task to run — which seems to make
sense to me. In the situation I just described, we can consider the
time stolen in update_rq_clock_task() itself to have been stolen from
the *incoming* task, not the *outgoing* one. But that seems to be what
you're objecting to?

In
https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@xxxxxxxxxxxxx/
I put a limit on the amount of steal time carried forward from one
timeslice to the next, as it was misbehaving when a bad hypervisor
reported negative steal time. But I don't think the limit should be
zero.

Of course, *ideally* we'd be able to sample the time and steal time
*simultaneously*, with a single sched_clock_cpu_and_steal() function so
that we don't have to deal with this slop between readings. Then none
of this would be necessary. But that seems hard.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature