Lenny Szubowicz <lszubowi@xxxxxxxxxx> writes: > Turn off host updates to the registered kvmclock memory > locations when transitioning to a hibernated kernel in > resume_target_kernel(). > > This is accomplished for secondary vcpus by disabling host > clock updates for that vcpu when it is put offline. For the > primary vcpu, it's accomplished by using the existing call back > from save_processor_state() to kvm_save_sched_clock_state(). > > The registered kvmclock memory locations may differ between > the currently running kernel and the hibernated kernel, which > is being restored and resumed. Kernel memory corruption is thus > possible if the host clock updates are allowed to run while the > hibernated kernel is relocated to its original physical memory > locations. > > This is similar to the problem solved for kexec by > commit 1e977aa12dd4 ("x86: KVM guest: disable clock before rebooting.") > > Commit 95a3d4454bb1 ("x86/kvmclock: Switch kvmclock data to a > PER_CPU variable") innocently increased the exposure for this > problem by dynamically allocating the physical pages that are > used for host clock updates when the vcpu count exceeds 64. > This increases the likelihood that the registered kvmclock > locations will differ for vcpus above 64. > > Reported-by: Xiaoyi Chen <cxiaoyi@xxxxxxxxxx> > Tested-by: Mohamed Aboubakr <mabouba@xxxxxxxxxx> > Signed-off-by: Lenny Szubowicz <lszubowi@xxxxxxxxxx> > --- > arch/x86/kernel/kvmclock.c | 34 ++++++++++++++++++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index aa593743acf6..291ffca41afb 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -187,8 +187,18 @@ static void kvm_register_clock(char *txt) > pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt); > } > > +/* > + * Turn off host clock updates to the registered memory location when the > + * cpu clock context is saved via save_processor_state(). Enables correct > + * handling of the primary cpu clock when transitioning to a hibernated > + * kernel in resume_target_kernel(), where the old and new registered > + * memory locations may differ. > + */ > static void kvm_save_sched_clock_state(void) > { > + native_write_msr(msr_kvm_system_time, 0, 0); > + kvm_disable_steal_time(); > + pr_info("kvm-clock: cpu %d, clock stopped", smp_processor_id()); > } Nitpick: should we rename kvm_save_sched_clock_state() to something more generic, like kvm_disable_host_clock_updates() to indicate, that what it does is not only sched clock related? > > static void kvm_restore_sched_clock_state(void) > @@ -311,9 +321,23 @@ static int kvmclock_setup_percpu(unsigned int cpu) > return p ? 0 : -ENOMEM; > } > > +/* > + * Turn off host clock updates to the registered memory location when a > + * cpu is placed offline. Enables correct handling of secondary cpu clocks > + * when transitioning to a hibernated kernel in resume_target_kernel(), > + * where the old and new registered memory locations may differ. > + */ > +static int kvmclock_cpu_offline(unsigned int cpu) > +{ > + native_write_msr(msr_kvm_system_time, 0, 0); > + pr_info("kvm-clock: cpu %d, clock stopped", cpu); I'd say this pr_info() is superfluous: on a system with hundereds of vCPUs users will get flooded with 'clock stopped' messages which don't actually mean much: in case native_write_msr() fails the error gets reported in dmesg anyway. I'd suggest we drop this and pr_info() in kvm_save_sched_clock_state(). > + return 0; Why don't we disable steal time accounting here? MSR_KVM_STEAL_TIME is also per-cpu. Can we merge kvm_save_sched_clock_state() with kvmclock_cpu_offline() maybe? > +} > + > void __init kvmclock_init(void) > { > u8 flags; > + int cpuhp_prepare; > > if (!kvm_para_available() || !kvmclock) > return; > @@ -325,8 +349,14 @@ void __init kvmclock_init(void) > return; > } > > - if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu", > - kvmclock_setup_percpu, NULL) < 0) { > + cpuhp_prepare = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, > + "kvmclock:setup_percpu", > + kvmclock_setup_percpu, NULL); > + if (cpuhp_prepare < 0) > + return; > + if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "kvmclock:cpu_offline", > + NULL, kvmclock_cpu_offline) < 0) { > + cpuhp_remove_state(cpuhp_prepare); In case we fail to set up kvmclock_cpu_offline() callback we get broken hybernation but without kvmclock_setup_percpu() call we get a broken guest (as kvmclock() may be the only reliable clocksource) so I'm not exactly sure we're active in a best way with cpuhp_remove_state() here. I don't feel strong though, I think it can stay but in that case I'd add a pr_warn() at least. > return; > } -- Vitaly