On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim Krčmář wrote: > PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it. > All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW). > The reverted patch treated the MSR write as a one-shot initializer: > any write from VCPU 0 would reset system_time. > > And this is what broke for Linux guests: > * Onlining/hotplugging of VCPU 0 > > VCPU has to assign an address to KVM clock before use, which is done > with MSR_KVM_SYSTEM_TIME_NEW. Linux has an idea that time should not > jump backward, so any `sleep` won't return before |system_time at the > point of offline| elapses since the online. Be sure to run ntp. > > * S3 and S4 resume > > If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze > for |system_time at the point of suspend|, because pvclock ensures > monoticity and kvmclock did not think about it. > > If you have stable clock, execution will resume immediately, but > restoring KVM clock writes to the MSR and dmesg starts to count from > zero. It's better than the onlining, but not what we want either. > > * Boot of SLES 10 guest > > SLES 10 has a custom implementation of kvm clock that calls > MSR_KVM_SYSTEM_TIME before every read to enhance precision ... > Two things are happening at the same time: > 1) The guest periodically receives an interrupt that is handled by > main_timer_handler(): > a) get time using the kvm clock: > 1) write the address to MSR_KVM_SYSTEM_TIME > 2) read tsc and pvclock (tsc_offset, system_time) > 3) time = tsc - tsc_offset + system_time > b) compute time since the last main_timer_handler() > c) bump jiffies if enough time has elapsed > 2) the guest wants to calibrate loops per jiffy [1]: > a) read tsc > b) loop till jiffies increase > c) compute lpj > > Because (1a1) always resets the system_time to 0, we read the same > value over and over so the condition for (1c) is never true and > jiffies remain constant. A hang happens in (2b) as it is the first > place that depends on jiffies. > > > We could make hypervisor workaround for this, but that is just asking > for more trouble. Luckily, reverting does not break to guests that > learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways. > Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which > is a good trade-off for not regressing. > > This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5. > And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO. > > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Radim Krčmář <rkrcmar@xxxxxxxxxx> > --- > v1: Extended commit message based on a discussion with Marcelo > > arch/x86/include/asm/pvclock-abi.h | 1 + > arch/x86/kvm/x86.c | 4 ---- > 2 files changed, 1 insertion(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h > index 655e07a48f6c..67f08230103a 100644 > --- a/arch/x86/include/asm/pvclock-abi.h > +++ b/arch/x86/include/asm/pvclock-abi.h > @@ -41,6 +41,7 @@ struct pvclock_wall_clock { > > #define PVCLOCK_TSC_STABLE_BIT (1 << 0) > #define PVCLOCK_GUEST_STOPPED (1 << 1) > +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */ > #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2) > #endif /* __ASSEMBLY__ */ > #endif /* _ASM_X86_PVCLOCK_ABI_H */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 4bca39f0fdb3..71731994d897 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > vcpu->pvclock_set_guest_stopped_request = false; > } > > - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO; > - > /* If the host uses TSC clocksource, then it is stable */ > if (use_master_clock) > pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; > @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > &vcpu->requests); > > ka->boot_vcpu_runs_old_kvmclock = tmp; > - > - ka->kvmclock_offset = -get_kernel_ns(); > } > > vcpu->arch.time = data; > -- > 2.5.3 NACK, please use original patchset. -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html