On Tue, Sep 03, 2024, Ashish Kalra wrote: > From: Ashish Kalra <ashish.kalra@xxxxxxx> > [ 1.671804] AMD-Vi: Using global IVHD EFR:0x841f77e022094ace, EFR2:0x0 > [ 1.679835] AMD-Vi: Translation is already enabled - trying to copy translation structures > [ 1.689363] AMD-Vi: Copied DEV table from previous kernel. > [ 1.864369] AMD-Vi: Completion-Wait loop timed out > [ 2.038289] AMD-Vi: Completion-Wait loop timed out > [ 2.212215] AMD-Vi: Completion-Wait loop timed out > [ 2.386141] AMD-Vi: Completion-Wait loop timed out > [ 2.560068] AMD-Vi: Completion-Wait loop timed out > [ 2.733997] AMD-Vi: Completion-Wait loop timed out > [ 2.907927] AMD-Vi: Completion-Wait loop timed out > [ 3.081855] AMD-Vi: Completion-Wait loop timed out > [ 3.225500] AMD-Vi: Completion-Wait loop timed out > [ 3.231083] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 > d out > [ 3.579592] AMD-Vi: Completion-Wait loop timed out > [ 3.753164] AMD-Vi: Completion-Wait loop timed out > [ 3.815762] Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC > [ 3.825347] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc3-next-20240813-snp-host-f2a41ff576cc-dirty #61 > [ 3.837188] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM100AB 10/17/2022 > [ 3.846215] Call Trace: > [ 3.848939] <TASK> > [ 3.851277] dump_stack_lvl+0x2b/0x90 > [ 3.855354] dump_stack+0x14/0x20 > [ 3.859050] panic+0x3b9/0x400 > [ 3.862454] panic_if_irq_remap+0x21/0x30 > [ 3.866925] setup_IO_APIC+0x8aa/0xa50 > [ 3.871106] ? __pfx_amd_iommu_enable_faulting+0x10/0x10 > [ 3.877032] ? __cpuhp_setup_state+0x5e/0xd0 > [ 3.881793] apic_intr_mode_init+0x6a/0xf0 > [ 3.886360] x86_late_time_init+0x28/0x40 > [ 3.890832] start_kernel+0x6a8/0xb50 > [ 3.894914] x86_64_start_reservations+0x1c/0x30 > [ 3.900064] x86_64_start_kernel+0xbf/0x110 > [ 3.904729] ? setup_ghcb+0x12/0x130 > [ 3.908716] common_startup_64+0x13e/0x141 > [ 3.913283] </TASK> > [ 3.915715] in panic > [ 3.918149] in panic_other_cpus_shutdown > [ 3.922523] ---[ end Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC ]--- > > This happens as SNP_SHUTDOWN_EX fails Exactly what happens? I.e. why does the PSP (sorry, ASP) need to be full shutdown in order to get a kdump? The changelogs for the original SNP panic/kdump support are frustratingly unhelpful. They all describe what the patch does, but say nothing about _why_. > when SNP VMs are active as the firmware checks every encryption-capable ASID > to verify that it is not in use by a guest and a DF_FLUSH is not required. If > a DF_FLUSH is required, the firmware returns DFFLUSH_REQUIRED. > > To fix this, added support to do SNP_DECOMMISSION of all active SNP VMs > in the panic notifier before doing SNP_SHUTDOWN_EX, but then > SNP_DECOMMISSION tags all CPUs on which guest has been activated to do > a WBINVD. This causes SNP_DF_FLUSH command failure with the following > flow: SNP_DECOMMISSION -> SNP_SHUTDOWN_EX -> SNP_DF_FLUSH -> > failure with WBINVD_REQUIRED. > > When panic notifier is invoked all other CPUs have already been > shutdown, so it is not possible to do a wbinvd_on_all_cpus() after > SNP_DECOMMISSION has been executed. This eventually causes SNP_SHUTDOWN_EX > to fail after SNP_DECOMMISSION. > > Adding fix to do SNP_DECOMMISSION and subsequent WBINVD on all CPUs > during NMI shutdown of CPUs as part of disabling virtualization on > all CPUs via cpu_emergency_disable_virtualization -> > svm_emergency_disable(). > > SNP_DECOMMISSION unbinds the ASID from SNP context and marks the ASID > as unusable and then transitions the SNP guest context page to a > firmware page and SNP_SHUTDOWN_EX transitions all pages associated > with the IOMMU to reclaim state which the hypervisor then transitions > to hypervisor state, all these page state changes are in the RMP > table, so there is no loss of guest data as such and the complete > host memory is captured with the crashkernel boot. There are no > processes which are being killed and host/guest memory is not > being altered or modified in any way. > > This fixes and enables crashkernel/kdump on SNP host. ... > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 714c517dd4b7..30f286a3afb0 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -16,6 +16,7 @@ > #include <linux/psp-sev.h> > #include <linux/pagemap.h> > #include <linux/swap.h> > +#include <linux/delay.h> > #include <linux/misc_cgroup.h> > #include <linux/processor.h> > #include <linux/trace_events.h> > @@ -89,6 +90,8 @@ static unsigned int nr_asids; > static unsigned long *sev_asid_bitmap; > static unsigned long *sev_reclaim_asid_bitmap; > > +static DEFINE_SPINLOCK(snp_decommission_lock); > +static void **snp_asid_to_gctx_pages_map; > static int snp_decommission_context(struct kvm *kvm); > > struct enc_region { > @@ -2248,6 +2251,9 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) > goto e_free_context; > } > > + if (snp_asid_to_gctx_pages_map) > + snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = sev->snp_context; > + > return 0; > > e_free_context: > @@ -2884,9 +2890,126 @@ static int snp_decommission_context(struct kvm *kvm) > snp_free_firmware_page(sev->snp_context); > sev->snp_context = NULL; > > + if (snp_asid_to_gctx_pages_map) > + snp_asid_to_gctx_pages_map[sev_get_asid(kvm)] = NULL; > + > return 0; > } > > +static void __snp_decommission_all(void) > +{ > + struct sev_data_snp_addr data = {}; > + int ret, asid; > + > + if (!snp_asid_to_gctx_pages_map) > + return; > + > + for (asid = 1; asid < min_sev_asid; asid++) { > + if (snp_asid_to_gctx_pages_map[asid]) { > + data.address = __sme_pa(snp_asid_to_gctx_pages_map[asid]); NULL pointer deref if this races with snp_decommission_context() from task context. > + ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL); > + if (!ret) { And what happens if SEV_CMD_SNP_DECOMMISSION fails? > + snp_free_firmware_page(snp_asid_to_gctx_pages_map[asid]); > + snp_asid_to_gctx_pages_map[asid] = NULL; > + } > + } > + } > +} > + > +/* > + * NOTE: called in NMI context from svm_emergency_disable(). > + */ > +void sev_emergency_disable(void) > +{ > + static atomic_t waiting_for_cpus_synchronized; > + static bool synchronize_cpus_initiated; > + static bool snp_decommission_handled; > + static atomic_t cpus_synchronized; > + > + if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) > + return; > + > + /* > + * SNP_SHUTDOWN_EX fails when SNP VMs are active as the firmware checks Define "active". > + * every encryption-capable ASID to verify that it is not in use by a > + * guest and a DF_FLUSH is not required. If a DF_FLUSH is required, > + * the firmware returns DFFLUSH_REQUIRED. To address this, SNP_DECOMMISSION > + * is required to shutdown all active SNP VMs, but SNP_DECOMMISSION tags all > + * CPUs that guest was activated on to do a WBINVD. When panic notifier > + * is invoked all other CPUs have already been shutdown, so it is not > + * possible to do a wbinvd_on_all_cpus() after SNP_DECOMMISSION has been > + * executed. This eventually causes SNP_SHUTDOWN_EX to fail after > + * SNP_DECOMMISSION. To fix this, do SNP_DECOMMISSION and subsequent WBINVD > + * on all CPUs during NMI shutdown of CPUs as part of disabling > + * virtualization on all CPUs via cpu_emergency_disable_virtualization(). > + */ > + > + spin_lock(&snp_decommission_lock); > + > + /* > + * exit early for call from native_machine_crash_shutdown() > + * as SNP_DECOMMISSION has already been done as part of > + * NMI shutdown of the CPUs. > + */ > + if (snp_decommission_handled) { > + spin_unlock(&snp_decommission_lock); > + return; > + } > + > + /* > + * Synchronize all CPUs handling NMI before issuing > + * SNP_DECOMMISSION. > + */ > + if (!synchronize_cpus_initiated) { > + /* > + * one CPU handling panic, the other CPU is initiator for > + * CPU synchronization. > + */ > + atomic_set(&waiting_for_cpus_synchronized, num_online_cpus() - 2); And what happens when num_online_cpus() == 1? > + synchronize_cpus_initiated = true; > + /* > + * Ensure CPU synchronization parameters are setup before dropping > + * the lock to let other CPUs continue to reach synchronization. > + */ > + wmb(); > + > + spin_unlock(&snp_decommission_lock); > + > + /* > + * This will not cause system to hang forever as the CPU > + * handling panic waits for maximum one second for > + * other CPUs to stop in nmi_shootdown_cpus(). > + */ > + while (atomic_read(&waiting_for_cpus_synchronized) > 0) > + mdelay(1); > + > + /* Reacquire the lock once CPUs are synchronized */ > + spin_lock(&snp_decommission_lock); > + > + atomic_set(&cpus_synchronized, 1); > + } else { > + atomic_dec(&waiting_for_cpus_synchronized); > + /* > + * drop the lock to let other CPUs contiune to reach > + * synchronization. > + */ > + spin_unlock(&snp_decommission_lock); > + > + while (atomic_read(&cpus_synchronized) == 0) > + mdelay(1); > + > + /* Try to re-acquire lock after CPUs are synchronized */ > + spin_lock(&snp_decommission_lock); > + } Yeah, no. This is horrific. If the panic path doesn't provide the necessary infrastructure to ensure the necessary ordering between the initiating CPU and responding CPUs, then rework the panic path.