Re: [RFC PATCH 3/3] sched, x86: Make the scheduler guest unhalted aware

"Sieber, Fernand" <sieberf@xxxxxxxxxx> · Thu, 27 Feb 2025 08:27:00 +0000

On Thu, 2025-02-27 at 08:34 +0100, Vincent Guittot wrote:
> On Tue, 18 Feb 2025 at 21:27, Fernand Sieber <sieberf@xxxxxxxxxx>
> wrote:
> > 
> > With guest hlt/mwait/pause pass through, the scheduler has no
> > visibility into
> > real vCPU activity as it sees them all 100% active. As such, load
> > balancing
> > cannot make informed decisions on where it is preferrable to
> > collocate
> > tasks when necessary. I.e as far as the load balancer is concerned,
> > a
> > halted vCPU and an idle polling vCPU look exactly the same so it
> > may decide
> > that either should be preempted when in reality it would be
> > preferrable to
> > preempt the idle one.
> > 
> > This commits enlightens the scheduler to real guest activity in
> > this
> > situation. Leveraging gtime unhalted, it adds a hook for kvm to
> > communicate
> > to the scheduler the duration that a vCPU spends halted. This is
> > then used in
> > PELT accounting to discount it from real activity. This results in
> > better
> > placement and overall steal time reduction.
> 
> NAK, PELT account for time spent by se on the CPU. 

I was essentially aiming to adjust this concept to "PELT account for
the time spent by se *unhalted* on the CPU". Would such an adjustments
of the definition cause problems?

> If your thread/vcpu doesn't do anything but burn cycles, find another
> way to report thatto the host

The main advantage of hooking into PELT is that it means that load
balancing will just work out of the box as it immediately adjusts the
sched_group util/load/runnable values.

It may be possible to scope down my change to load balancing without
touching PELT if that is not viable. For example instead of using PELT
we could potentially adjust the calculation of sgs->avg_load in
update_sg_lb_stats for overloaded groups to include a correcting factor
based on recent halted cycles of the CPU. The comparison of two
overloaded groups would then favor pulling tasks on the one that has
the most halted cycles. This approach is more scoped down as it doesn't
change the classification of scheduling groups, instead it just changes
how overloaded groups are compared. I would need to prototype to see if
it works.

Let me know if this would go in the right direction or if you have any
other ideas of alternate options?

> Furthermore this breaks all the hierarchy dependency

I am not understanding the meaning of this comment, could you please
provide more details?

> 
> > 
> > This initial implementation assumes that non-idle CPUs are ticking
> > as it
> > hooks the unhalted time the PELT decaying load accounting. As such
> > it
> > doesn't work well if PELT is updated infrequenly with large chunks
> > of
> > halted time. This is not a fundamental limitation but more complex
> > accounting is needed to generalize the use case to nohz full.

Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07