Re: [PATCH v2] x86/sev: Fix host kdump support for SNP

"Kalra, Ashish" <ashish.kalra@xxxxxxx> · Tue, 3 Sep 2024 17:58:44 -0500

Hello Sean,

On 9/3/2024 2:52 PM, Sean Christopherson wrote:
> On Tue, Sep 03, 2024, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@xxxxxxx>
>> [    1.671804] AMD-Vi: Using global IVHD EFR:0x841f77e022094ace, EFR2:0x0
>> [    1.679835] AMD-Vi: Translation is already enabled - trying to copy translation structures
>> [    1.689363] AMD-Vi: Copied DEV table from previous kernel.
>> [    1.864369] AMD-Vi: Completion-Wait loop timed out
>> [    2.038289] AMD-Vi: Completion-Wait loop timed out
>> [    2.212215] AMD-Vi: Completion-Wait loop timed out
>> [    2.386141] AMD-Vi: Completion-Wait loop timed out
>> [    2.560068] AMD-Vi: Completion-Wait loop timed out
>> [    2.733997] AMD-Vi: Completion-Wait loop timed out
>> [    2.907927] AMD-Vi: Completion-Wait loop timed out
>> [    3.081855] AMD-Vi: Completion-Wait loop timed out
>> [    3.225500] AMD-Vi: Completion-Wait loop timed out
>> [    3.231083] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>> d out
>> [    3.579592] AMD-Vi: Completion-Wait loop timed out
>> [    3.753164] AMD-Vi: Completion-Wait loop timed out
>> [    3.815762] Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC
>> [    3.825347] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc3-next-20240813-snp-host-f2a41ff576cc-dirty #61
>> [    3.837188] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM100AB 10/17/2022
>> [    3.846215] Call Trace:
>> [    3.848939]  <TASK>
>> [    3.851277]  dump_stack_lvl+0x2b/0x90
>> [    3.855354]  dump_stack+0x14/0x20
>> [    3.859050]  panic+0x3b9/0x400
>> [    3.862454]  panic_if_irq_remap+0x21/0x30
>> [    3.866925]  setup_IO_APIC+0x8aa/0xa50
>> [    3.871106]  ? __pfx_amd_iommu_enable_faulting+0x10/0x10
>> [    3.877032]  ? __cpuhp_setup_state+0x5e/0xd0
>> [    3.881793]  apic_intr_mode_init+0x6a/0xf0
>> [    3.886360]  x86_late_time_init+0x28/0x40
>> [    3.890832]  start_kernel+0x6a8/0xb50
>> [    3.894914]  x86_64_start_reservations+0x1c/0x30
>> [    3.900064]  x86_64_start_kernel+0xbf/0x110
>> [    3.904729]  ? setup_ghcb+0x12/0x130
>> [    3.908716]  common_startup_64+0x13e/0x141
>> [    3.913283]  </TASK>
>> [    3.915715] in panic
>> [    3.918149] in panic_other_cpus_shutdown
>> [    3.922523] ---[ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]---
>>
>> This happens as SNP_SHUTDOWN_EX fails
> Exactly what happens?  I.e. why does the PSP (sorry, ASP) need to be full shutdown
> in order to get a kdump?  The changelogs for the original SNP panic/kdump support
> are frustratingly unhelpful.  They all describe what the patch does, but say
> nothing about _why_.

If SNP_SHUTDOWN_EX is not issued, or more accurately if SNP_SHUTDOWN_EX is not issued with IOMMU_SNP_SHUTDOWN set to 1, i.e, to disable SNP enforcement by IOMMU, then IOMMUs are initialized in an IRQ remapping configuration by early_enable_iommus() during crashkernel boot, which causes the above crash in check_timer().

Also, SNP_SHUTDOWN_EX is required to tear down the RMP CPU/IOMMU TMRs.

Additionally, if IOMMU SNP_SHUTDOWN is not done and all pages associated with the IOMMU are not transitioned to Reclaim state (and then subsequently by HV to hypervisor state) there is a really high probability of a fatal RMP page fault due to collision with a new page allocation during kexec boot due to collision with one of these IOMMU pages.

So we really need SNP_SHUTDOWN_EX with IOMMU_SNP_SHUTDOWN=1 for kexec/crashkernel boot.

>
>> when SNP VMs are active as the firmware checks every encryption-capable ASID
>> to verify that it is not in use by a guest and a DF_FLUSH is not required. If
>> a DF_FLUSH is required, the firmware returns DFFLUSH_REQUIRED.
>>
>> To fix this, added support to do SNP_DECOMMISSION of all active SNP VMs
>> in the panic notifier before doing SNP_SHUTDOWN_EX, but then
>> SNP_DECOMMISSION tags all CPUs on which guest has been activated to do
>> a WBINVD. This causes SNP_DF_FLUSH command failure with the following
>> flow: SNP_DECOMMISSION -> SNP_SHUTDOWN_EX -> SNP_DF_FLUSH ->
>> failure with WBINVD_REQUIRED.
>>
>> When panic notifier is invoked all other CPUs have already been
>> shutdown, so it is not possible to do a wbinvd_on_all_cpus() after
>> SNP_DECOMMISSION has been executed. This eventually causes SNP_SHUTDOWN_EX
>> to fail after SNP_DECOMMISSION.
>>
>> Adding fix to do SNP_DECOMMISSION and subsequent WBINVD on all CPUs
>> during NMI shutdown of CPUs as part of disabling virtualization on
>> all CPUs via cpu_emergency_disable_virtualization ->
>> svm_emergency_disable().
>>
>> SNP_DECOMMISSION unbinds the ASID from SNP context and marks the ASID
>> as unusable and then transitions the SNP guest context page to a
>> firmware page and SNP_SHUTDOWN_EX transitions all pages associated
>> with the IOMMU to reclaim state which the hypervisor then transitions
>> to hypervisor state, all these page state changes are in the RMP
>> table, so there is no loss of guest data as such and the complete
>> host memory is captured with the crashkernel boot. There are no
>> processes which are being killed and host/guest memory is not
>> being altered or modified in any way.
>>
>> This fixes and enables crashkernel/kdump on SNP host.
> ...
>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 714c517dd4b7..30f286a3afb0 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -16,6 +16,7 @@
>>  #include <linux/psp-sev.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/swap.h>
>> +#include <linux/delay.h>
>>  #include <linux/misc_cgroup.h>
>>  #include <linux/processor.h>
>>  #include <linux/trace_events.h>
>> @@ -89,6 +90,8 @@ static unsigned int nr_asids;
>>  static unsigned long *sev_asid_bitmap;
>>  static unsigned long *sev_reclaim_asid_bitmap;
>>  
>> +static DEFINE_SPINLOCK(snp_decommission_lock);
>> +static void **snp_asid_to_gctx_pages_map;
>>  static int snp_decommission_context(struct kvm *kvm);
>>  
>>  struct enc_region {
>> @@ -2248,6 +2251,9 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  		goto e_free_context;
>>  	}
>>  
>> +	if (snp_asid_to_gctx_pages_map)
>> +		snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = sev->snp_context;
>> +
>>  	return 0;
>>  
>>  e_free_context:
>> @@ -2884,9 +2890,126 @@ static int snp_decommission_context(struct kvm *kvm)
>>  	snp_free_firmware_page(sev->snp_context);
>>  	sev->snp_context = NULL;
>>  
>> +	if (snp_asid_to_gctx_pages_map)
>> +		snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = NULL;
>> +
>>  	return 0;
>>  }
>>  
>> +static void __snp_decommission_all(void)
>> +{
>> +	struct sev_data_snp_addr data = {};
>> +	int ret, asid;
>> +
>> +	if (!snp_asid_to_gctx_pages_map)
>> +		return;
>> +
>> +	for (asid = 1; asid < min_sev_asid; asid++) {
>> +		if (snp_asid_to_gctx_pages_map[asid]) {
>> +			data.address = __sme_pa(snp_asid_to_gctx_pages_map[asid]);
> NULL pointer deref if this races with snp_decommission_context() from task
> context.

Yes this race needs to be handled.

I am looking at if i can do SNP_DECOMMISSION on all (SNP) ASIDs, and continue with the loop if SNP_DECOMMISSION fails with INVALID_GUEST error as the guest has already been decommissioned via snp_decommission_context().

>> +			ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
>> +			if (!ret) {
> And what happens if SEV_CMD_SNP_DECOMMISSION fails?

As mentioned above, will add additional handling for cases where INVALID_GUEST kind of errors are returned.

But, if errors like UPDATE_FAILED are returned, this will eventually cause SNP_SHUTDOWN failure and then lead to crashkernel boot failure as discussed above.

>
>> +				snp_free_firmware_page(snp_asid_to_gctx_pages_map[asid]);
>> +				snp_asid_to_gctx_pages_map[asid] = NULL;
>> +			}
>> +		}
>> +	}
>> +}
>> +
>> +/*
>> + * NOTE: called in NMI context from svm_emergency_disable().
>> + */
>> +void sev_emergency_disable(void)
>> +{
>> +	static atomic_t waiting_for_cpus_synchronized;
>> +	static bool synchronize_cpus_initiated;
>> +	static bool snp_decommission_handled;
>> +	static atomic_t cpus_synchronized;
>> +
>> +	if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
>> +		return;
>> +
>> +	/*
>> +	 * SNP_SHUTDOWN_EX fails when SNP VMs are active as the firmware checks
> Define "active".
>
>> +	 * every encryption-capable ASID to verify that it is not in use by a
>> +	 * guest and a DF_FLUSH is not required. If a DF_FLUSH is required,
>> +	 * the firmware returns DFFLUSH_REQUIRED. To address this, SNP_DECOMMISSION
>> +	 * is required to shutdown all active SNP VMs, but SNP_DECOMMISSION tags all
>> +	 * CPUs that guest was activated on to do a WBINVD. When panic notifier
>> +	 * is invoked all other CPUs have already been shutdown, so it is not
>> +	 * possible to do a wbinvd_on_all_cpus() after SNP_DECOMMISSION has been
>> +	 * executed. This eventually causes SNP_SHUTDOWN_EX to fail after
>> +	 * SNP_DECOMMISSION. To fix this, do SNP_DECOMMISSION and subsequent WBINVD
>> +	 * on all CPUs during NMI shutdown of CPUs as part of disabling
>> +	 * virtualization on all CPUs via cpu_emergency_disable_virtualization().
>> +	 */
>> +
>> +	spin_lock(&snp_decommission_lock);
>> +
>> +	/*
>> +	 * exit early for call from native_machine_crash_shutdown()
>> +	 * as SNP_DECOMMISSION has already been done as part of
>> +	 * NMI shutdown of the CPUs.
>> +	 */
>> +	if (snp_decommission_handled) {
>> +		spin_unlock(&snp_decommission_lock);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Synchronize all CPUs handling NMI before issuing
>> +	 * SNP_DECOMMISSION.
>> +	 */
>> +	if (!synchronize_cpus_initiated) {
>> +		/*
>> +		 * one CPU handling panic, the other CPU is initiator for
>> +		 * CPU synchronization.
>> +		 */
>> +		atomic_set(&waiting_for_cpus_synchronized, num_online_cpus() - 2);
> And what happens when num_online_cpus() == 1?
Yes, will add fix for that.
>> +		synchronize_cpus_initiated = true;
>> +		/*
>> +		 * Ensure CPU synchronization parameters are setup before dropping
>> +		 * the lock to let other CPUs continue to reach synchronization.
>> +		 */
>> +		wmb();
>> +
>> +		spin_unlock(&snp_decommission_lock);
>> +
>> +		/*
>> +		 * This will not cause system to hang forever as the CPU
>> +		 * handling panic waits for maximum one second for
>> +		 * other CPUs to stop in nmi_shootdown_cpus().
>> +		 */
>> +		while (atomic_read(&waiting_for_cpus_synchronized) > 0)
>> +		       mdelay(1);
>> +
>> +		/* Reacquire the lock once CPUs are synchronized */
>> +		spin_lock(&snp_decommission_lock);
>> +
>> +		atomic_set(&cpus_synchronized, 1);
>> +	} else {
>> +		atomic_dec(&waiting_for_cpus_synchronized);
>> +		/*
>> +		 * drop the lock to let other CPUs contiune to reach
>> +		 * synchronization.
>> +		 */
>> +		spin_unlock(&snp_decommission_lock);
>> +
>> +		while (atomic_read(&cpus_synchronized) == 0)
>> +		       mdelay(1);
>> +
>> +		/* Try to re-acquire lock after CPUs are synchronized */
>> +		spin_lock(&snp_decommission_lock);
>> +	}
> Yeah, no.  This is horrific.  If the panic path doesn't provide the necessary
> infrastructure to ensure the necessary ordering between the initiating CPU and
> responding CPUs, then rework the panic path.

The issue here is that panic path will ensure that all (other) CPUs have been shutdown via NMI by checking that they have executed the NMI shutdown callback.

But the above synchronization is specifically required for SNP case, as we don't want to execute the SNP_DECOMMISSION command (to destroy SNP guest context) while one or more CPUs are still in the NMI VMEXIT path and still in the process of saving the vCPU state (and still modifying SNP guest context?) during this VMEXIT path. Therefore, we ensure that all the CPUs have saved the vCPU state and entered NMI context before issuing SNP_DECOMMISSION. The point is that this is a specific SNP requirement (and that's why this specific handling in sev_emergency_disable()) and i don't know how we will be able to enforce it in the generic panic path ?

Thanks, Ashish