On Fri, Jan 17, 2025, Suleiman Souhlal wrote: > On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Tue, Jan 07, 2025, Suleiman Souhlal wrote: > > > It returns the cumulative nanoseconds that the host has been suspended. > > > It is intended to be used for reporting host suspend time to the guest. > > > > ... > > > > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state) > > > +{ > > > + switch (state) { > > > + case PM_HIBERNATION_PREPARE: > > > + case PM_SUSPEND_PREPARE: > > > + last_suspend = ktime_get_boottime_ns(); > > > + case PM_POST_HIBERNATION: > > > + case PM_POST_SUSPEND: > > > + total_suspend_ns += ktime_get_boottime_ns() - last_suspend; > > > > After spending too much time poking around kvmlock and sched_clock code, I'm pretty > > sure that accounting *all* suspend time to steal_time is wildly inaccurate for > > most clocksources that will be used by KVM x86 guests. > > > > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going > > backwards due to suspend+resume. I haven't dug super deep, buy I assume/hope the > > majority of suspend time is handled by massaging guest TSC. > > > > There's still a notable gap, as KVM's TSC adjustments likely won't account for > > the lag between CPUs coming online and vCPU's being restarted, but I don't know > > that having KVM account the suspend duration is the right way to solve that issue. > > (It is my understanding that steal time has no impact on clock sources.) > On our machines, the problem isn't that the TSC is going backwards. As > you say, kvmclock takes care of that. > > The problem these patches are trying to solve is that the time keeps > advancing for the VM while the host is suspended. Right, the issue is that because KVM adjusts guest TSC if the host TSC does go backwards, then the accounting will be all kinds of messed up. 1. Initiate suspend at host TSC X, guest TSC X+Y. 2. Save X into last_host_tsc via kvm_arch_vcpu_put(): vcpu->arch.last_host_tsc = rdtsc(); 3. Resume after N hours, host TSC reset to 0 and starts counting. 4. kvm_arch_enable_virtualization_cpu() runs at new host time Z. 5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)". u64 delta_cyc = max_tsc - local_tsc; list_for_each_entry(kvm, &vm_list, vm_list) { kvm->arch.backwards_tsc_observed = true; kvm_for_each_vcpu(i, vcpu, kvm) { vcpu->arch.tsc_offset_adjustment += delta_cyc; vcpu->arch.last_host_tsc = local_tsc; kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); } Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep advancing. kvmclock on the other hand counts from *host* boot, and so guest time keeps advancing if the guest is using kvmclock. #ifdef CONFIG_X86_64 static s64 get_kvmclock_base_ns(void) { /* Count up from boot time, but with the frequency of the raw clock. */ return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot)); } #else static s64 get_kvmclock_base_ns(void) { /* Master clock not used, so we can just use CLOCK_BOOTTIME. */ return ktime_get_boottime_ns(); } #endif In short, AFAICT the issues you are observing are mostly a problem with kvmclock. Or maybe it's the other way around and effectively freezing guest TSC is super problematic and fundamentally flawed. Regardless of which one is "broken", unconditionally accounting suspend time to steal_time will do the wrong thing when sched_clock=tsc. To further muddy the waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc, but sched_clock=kvmclock. In that scenario, guest time doesn't advanced, but guest scheduler time does. Ugh. That particular wart can be avoided by having the guest use TSC for sched_clock[*], e.g. so that at least the behavior of time is consistent. Hmm, if freezing guest time across suspend is indeed problematic, one thought would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the host may be suspending. The Linux-as-a-guest would prefer kvmclock over TSC for both clocksource and sched_clock. [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@xxxxxxxxxx