On Thu, 2025-02-27 at 08:34 +0100, Vincent Guittot wrote: > On Tue, 18 Feb 2025 at 21:27, Fernand Sieber <sieberf@xxxxxxxxxx> > wrote: > > > > With guest hlt/mwait/pause pass through, the scheduler has no > > visibility into > > real vCPU activity as it sees them all 100% active. As such, load > > balancing > > cannot make informed decisions on where it is preferrable to > > collocate > > tasks when necessary. I.e as far as the load balancer is concerned, > > a > > halted vCPU and an idle polling vCPU look exactly the same so it > > may decide > > that either should be preempted when in reality it would be > > preferrable to > > preempt the idle one. > > > > This commits enlightens the scheduler to real guest activity in > > this > > situation. Leveraging gtime unhalted, it adds a hook for kvm to > > communicate > > to the scheduler the duration that a vCPU spends halted. This is > > then used in > > PELT accounting to discount it from real activity. This results in > > better > > placement and overall steal time reduction. > > NAK, PELT account for time spent by se on the CPU. I was essentially aiming to adjust this concept to "PELT account for the time spent by se *unhalted* on the CPU". Would such an adjustments of the definition cause problems? > If your thread/vcpu doesn't do anything but burn cycles, find another > way to report thatto the host The main advantage of hooking into PELT is that it means that load balancing will just work out of the box as it immediately adjusts the sched_group util/load/runnable values. It may be possible to scope down my change to load balancing without touching PELT if that is not viable. For example instead of using PELT we could potentially adjust the calculation of sgs->avg_load in update_sg_lb_stats for overloaded groups to include a correcting factor based on recent halted cycles of the CPU. The comparison of two overloaded groups would then favor pulling tasks on the one that has the most halted cycles. This approach is more scoped down as it doesn't change the classification of scheduling groups, instead it just changes how overloaded groups are compared. I would need to prototype to see if it works. Let me know if this would go in the right direction or if you have any other ideas of alternate options? > Furthermore this breaks all the hierarchy dependency I am not understanding the meaning of this comment, could you please provide more details? > > > > > This initial implementation assumes that non-idle CPUs are ticking > > as it > > hooks the unhalted time the PELT decaying load accounting. As such > > it > > doesn't work well if PELT is updated infrequenly with large chunks > > of > > halted time. This is not a fundamental limitation but more complex > > accounting is needed to generalize the use case to nohz full. Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07