On Tue, Feb 18, 2025, Fernand Sieber wrote: > With guest hlt, pause and mwait pass through, the hypervisor loses > visibility on real guest cpu activity. From the point of view of the > host, such vcpus are always 100% active even when the guest is > completely halted. > > Typically hlt, pause and mwait pass through is only implemented on > non-timeshared pcpus. However, there are cases where this assumption > cannot be strictly met as some occasional housekeeping work needs to be What housekeeping work? > scheduled on such cpus while we generally want to preserve the pass > through performance gains. This applies for system which don't have > dedicated cpus for housekeeping purposes. > > In such cases, the lack of visibility of the hypervisor is problematic > from a load balancing point of view. In the absence of a better signal, > it will preemt vcpus at random. For example it could decide to interrupt > a vcpu doing critical idle poll work while another vcpu sits idle. > > Another motivation for gaining visibility into real guest cpu activity > is to enable the hypervisor to vend metrics about it for external > consumption. Such as? > In this RFC we introduce the concept of guest halted time to address > these concerns. Guest halted time (gtime_halted) accounts for cycles > spent in guest mode while the cpu is halted. gtime_halted relies on > measuring the mperf msr register (x86) around VM enter/exits to compute > the number of unhalted cycles; halted cycles are then derived from the > tsc difference minus the mperf difference. IMO, there are better ways to solve this than having KVM sample MPERF on every entry and exit. The kernel already samples APERF/MPREF on every tick and provides that information via /proc/cpuinfo, just use that. If your userspace is unable to use /proc/cpuinfo or similar, that needs to be explained. And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT passthrough, *and* you want to schedule other tasks on those CPUs, then IMO you're abusing all of those things and it's not KVM's problem to solve, especially now that sched_ext is a thing. > gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables > users to monitor real guest activity. > > gtime_halted is also plumbed to the scheduler infrastructure to discount > halted cycles from fair load accounting. This enlightens the load > balancer to real guest activity for better task placement. > > This initial RFC has a few limitations and open questions: > * only the x86 infrastructure is supported as it relies on architecture > dependent registers. Future development will extend this to ARM. > * we assume that mperf accumulates as the same rate as tsc. While I am > not certain whether this assumption is ever violated, the spec doesn't > seem to offer this guarantee [1] so we may want to calibrate mperf. > * the sched enlightenment logic relies on periodic gtime_halted updates. > As such, it is incompatible with nohz full because this could result > in long periods of no update followed by a massive halted time update > which doesn't play well with the existing PELT integration. It is > possible to address this limitation with generalized, more complex > accounting.