"Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> writes: > From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Sent: Thursday, November 3, 2022 12:06 PM >> >> Commit e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing >> to VP assist page MSR") moved 'wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE)' under >> 'if (*hvp)' condition. This works for root partition as hv_cpu_die() >> does memunmap() and sets 'hv_vp_assist_page[cpu]' to NULL but breaks >> non-root partitions as hv_cpu_die() doesn't free 'hv_vp_assist_page[cpu]' >> for them. > > Did you consider having hv_cpu_die() free the VP assist page and > set hv_vp_assist_page[cpu] to NULL in the non-root case? Oh yes, I did, and I even wrote a patch (attached) but it failed in testing :-( My testing was to run CPU onlining/offlining loop and and try using KVM on the same CPU at the same time. I was able to see issues when KVM was not able to reach to Enlightened VMCS because VP assist page was NULL. > That would > make the root and non-root cases more consistent, and it would make > hv_cpu_init() and hv_cpu_die() more symmetrical. The hv_cpu_die() > path frees up pretty much all the other per-CPU resources. I don't > know why it keeps the VP assist page for re-use if the CPU comes back > online later. > > You added the original code for allocating the vp_assist_page in > commit a46d15cc1a, so maybe you remember if there are any > gotchas. :-) The root cause of the problem I observed seems to be the order of CPU hotplug. Hyper-V uses the last CPUHP_AP_ONLINE_DYN stage while KVM has his own dedicated CPUHP_AP_KVM_STARTING one so we end up freeing VP assist page before turning KVM off on the CPU. It may be sufficient to introduce a new stage for Hyper-V and put it before KVM's. It's in my backlog to explore this path but it may take me some time to get back to it :-( > > Michael > >> This causes VP assist page to remain unset after CPU >> offline/online cycle: >> >> $ rdmsr -p 24 0x40000073 >> 10212f001 >> $ echo 0 > /sys/devices/system/cpu/cpu24/online >> $ echo 1 > /sys/devices/system/cpu/cpu24/online >> $ rdmsr -p 24 0x40000073 >> 0 >> >> Fix the issue by always writing to HV_X64_MSR_VP_ASSIST_PAGE in >> hv_cpu_init(). Note, checking 'if (!*hvp)', for root partition is >> pointless as hv_cpu_die() always sets 'hv_vp_assist_page[cpu]' to >> NULL (and it's also NULL initially). >> >> Note: the fact that 'hv_vp_assist_page[cpu]' is reset to NULL may >> present a (potential) issue for KVM. While Hyper-V uses >> CPUHP_AP_ONLINE_DYN stage in CPU hotplug, KVM uses >> CPUHP_AP_KVM_STARTING >> which comes earlier in CPU teardown sequence. It is theoretically >> possible that Enlightened VMCS is still in use. It is unclear if the >> issue is real and if using KVM with Hyper-V root partition is even >> possible. >> >> While on it, drop the unneeded smp_processor_id() call from hv_cpu_init(). >> >> Fixes: e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing to VP assist >> page MSR") >> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> >> --- >> arch/x86/hyperv/hv_init.c | 54 +++++++++++++++++++-------------------- >> 1 file changed, 26 insertions(+), 28 deletions(-) >> >> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c >> index f49bc3ec76e6..a269049a43ce 100644 >> --- a/arch/x86/hyperv/hv_init.c >> +++ b/arch/x86/hyperv/hv_init.c >> @@ -77,7 +77,7 @@ static int hyperv_init_ghcb(void) >> static int hv_cpu_init(unsigned int cpu) >> { >> union hv_vp_assist_msr_contents msr = { 0 }; >> - struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()]; >> + struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu]; >> int ret; >> >> ret = hv_common_cpu_init(cpu); >> @@ -87,34 +87,32 @@ static int hv_cpu_init(unsigned int cpu) >> if (!hv_vp_assist_page) >> return 0; >> >> - if (!*hvp) { >> - if (hv_root_partition) { >> - /* >> - * For root partition we get the hypervisor provided VP assist >> - * page, instead of allocating a new page. >> - */ >> - rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); >> - *hvp = memremap(msr.pfn << >> - >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT, >> - PAGE_SIZE, MEMREMAP_WB); >> - } else { >> - /* >> - * The VP assist page is an "overlay" page (see Hyper-V TLFS's >> - * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed >> - * out to make sure we always write the EOI MSR in >> - * hv_apic_eoi_write() *after* the EOI optimization is disabled >> - * in hv_cpu_die(), otherwise a CPU may not be stopped in the >> - * case of CPU offlining and the VM will hang. >> - */ >> + if (hv_root_partition) { >> + /* >> + * For root partition we get the hypervisor provided VP assist >> + * page, instead of allocating a new page. >> + */ >> + rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); >> + *hvp = memremap(msr.pfn << >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT, >> + PAGE_SIZE, MEMREMAP_WB); >> + } else { >> + /* >> + * The VP assist page is an "overlay" page (see Hyper-V TLFS's >> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed >> + * out to make sure we always write the EOI MSR in >> + * hv_apic_eoi_write() *after* the EOI optimization is disabled >> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the >> + * case of CPU offlining and the VM will hang. >> + */ >> + if (!*hvp) >> *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO); >> - if (*hvp) >> - msr.pfn = vmalloc_to_pfn(*hvp); >> - } >> - WARN_ON(!(*hvp)); >> - if (*hvp) { >> - msr.enable = 1; >> - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); >> - } >> + if (*hvp) >> + msr.pfn = vmalloc_to_pfn(*hvp); >> + >> + } >> + if (!WARN_ON(!(*hvp))) { >> + msr.enable = 1; >> + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); >> } >> >> return hyperv_init_ghcb(); >> -- >> 2.38.1 > -- Vitaly
>From 9b9e30e3fc182f6a43efca0f93b5851392074a3a Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Date: Thu, 3 Nov 2022 16:27:41 +0100 Subject: [PATCH] x86/hyperv: Free VP assist page from hv_cpu_die() Content-Type: text/plain Normally, 'hv_vp_assist_page[cpu]' points to CPU's VP assist page mapping. In case of Hyper-V root partition, this is 'memremap()' of the PFN given by the hypervisor. In case of a non-root partition, it's vmalloc(). When the CPU goes offline, hv_cpu_die() disables VP assist page by writing HV_X64_MSR_VP_ASSIST_PAGE and in case of root partition, does memunmap(). For non-root partitions, the vmalloc()ed page remains allocated and thus hv_cpu_init() has to check whether a new allocation is needed. This is unnecessary complicated. Let's always free the page from hv_cpu_die() and allocate it back from hv_cpu_init(). All VP assist page users have to be prepared to 'hv_vp_assist_page[cpu]' becoming NULL anyway as that's what happes already for the root partition. VP assist page has two users: KVM and APIC PV EOI. When a CPU goes offline, there cannot be a running guest and thus KVM's use case should be safe. As correctly noted in commit e320ab3cec7dd ("x86/hyper-v: Zero out the VP ASSIST PAGE on allocation"), it is possible to see interrupts after hv_cpu_die() and before the CPU is fully dead. hv_apic_eoi_write() is, however, also prepared to see NULL in 'hv_vp_assist_page[smp_processor_id()]'. Moreover, checking the page which is already unmapped from the hypervisor is incorrect in the first place. While on it, adjust VP assist page disabling a bit: always write to HV_X64_MSR_VP_ASSIST_PAGE first and unmap/free the corresponding page after, this is to make sure the hypervisor doesn't write to the already freed memory in the interim. Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> --- arch/x86/hyperv/hv_init.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index a0165df3c4d8..74be6f145fc4 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -104,8 +104,7 @@ static int hv_cpu_init(unsigned int cpu) * in hv_cpu_die(), otherwise a CPU may not be stopped in the * case of CPU offlining and the VM will hang. */ - if (!*hvp) - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO); + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO); if (*hvp) msr.pfn = vmalloc_to_pfn(*hvp); @@ -233,12 +232,17 @@ static int hv_cpu_die(unsigned int cpu) * page here and nullify it, so that in future we have * correct page address mapped in hv_cpu_init. */ - memunmap(hv_vp_assist_page[cpu]); - hv_vp_assist_page[cpu] = NULL; rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); msr.enable = 0; } wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); + + if (hv_root_partition) + memunmap(hv_vp_assist_page[cpu]); + else + vfree(hv_vp_assist_page[cpu]); + + hv_vp_assist_page[cpu] = NULL; } if (hv_reenlightenment_cb == NULL) -- 2.38.1