Re: [PATCH v2] x86/sev: Fix host kdump support for SNP

"Kalra, Ashish" <ashish.kalra@xxxxxxx> · Wed, 4 Sep 2024 16:31:03 -0500

Hello Mike,

On 9/4/2024 2:54 PM, Michael Roth wrote:
> On Wed, Sep 04, 2024 at 12:37:17PM -0500, Kalra, Ashish wrote:
>> Hello Paolo,
>>
>> On 9/4/2024 5:29 AM, Paolo Bonzini wrote:
>>> On 9/4/24 00:58, Kalra, Ashish wrote:
>>>> The issue here is that panic path will ensure that all (other) CPUs
>>>> have been shutdown via NMI by checking that they have executed the
>>>> NMI shutdown callback.
>>>>
>>>> But the above synchronization is specifically required for SNP case,
>>>> as we don't want to execute the SNP_DECOMMISSION command (to destroy
>>>> SNP guest context) while one or more CPUs are still in the NMI VMEXIT
>>>> path and still in the process of saving the vCPU state (and still
>>>> modifying SNP guest context?) during this VMEXIT path. Therefore, we
>>>> ensure that all the CPUs have saved the vCPU state and entered NMI
>>>> context before issuing SNP_DECOMMISSION. The point is that this is a
>>>> specific SNP requirement (and that's why this specific handling in
>>>> sev_emergency_disable()) and i don't know how we will be able to
>>>> enforce it in the generic panic path ?
>>> I think a simple way to do this is to _first_ kick out other
>>> CPUs through NMI, and then the one that is executing
>>> emergency_reboot_disable_virtualization().  This also makes
>>> emergency_reboot_disable_virtualization() and
>>> native_machine_crash_shutdown() more similar, in that
>>> the latter already stops other CPUs before disabling
>>> virtualization on the one that orchestrates the shutdown.
>>>
>>> Something like (incomplete, it has to also add the bool argument
>>> to cpu_emergency_virt_callback and the actual callbacks):
>>>
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index 340af8155658..3df25fbe969d 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -111,7 +111,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>  
>>>      crash_smp_send_stop();
>>>  
>>> -    cpu_emergency_disable_virtualization();
>>> +    cpu_emergency_disable_virtualization(true);
>>>  
>>>      /*
>>>       * Disable Intel PT to stop its logging
>>> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
>>> index 0e0a4cf6b5eb..7a86ec786987 100644
>>> --- a/arch/x86/kernel/reboot.c
>>> +++ b/arch/x86/kernel/reboot.c
>>> @@ -558,7 +558,7 @@ EXPORT_SYMBOL_GPL(cpu_emergency_unregister_virt_callback);
>>>   * reboot.  VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if
>>>   * GIF=0, i.e. if the crash occurred between CLGI and STGI.
>>>   */
>>> -void cpu_emergency_disable_virtualization(void)
>>> +void cpu_emergency_disable_virtualization(bool last)
>>>  {
>>>      cpu_emergency_virt_cb *callback;
>>>  
>>> @@ -572,7 +572,7 @@ void cpu_emergency_disable_virtualization(void)
>>>      rcu_read_lock();
>>>      callback = rcu_dereference(cpu_emergency_virt_callback);
>>>      if (callback)
>>> -        callback();
>>> +        callback(last);
>>>      rcu_read_unlock();
>>>  }
>>>  
>>> @@ -591,11 +591,11 @@ static void emergency_reboot_disable_virtualization(void)
>>>       * other CPUs may have virtualization enabled.
>>>       */
>>>      if (rcu_access_pointer(cpu_emergency_virt_callback)) {
>>> -        /* Safely force _this_ CPU out of VMX/SVM operation. */
>>> -        cpu_emergency_disable_virtualization();
>>> -
>>>          /* Disable VMX/SVM and halt on other CPUs. */
>>>          nmi_shootdown_cpus_on_restart();
>>> +
>>> +        /* Safely force _this_ CPU out of VMX/SVM operation. */
>>> +        cpu_emergency_disable_virtualization(true);
>>>      }
>>>  }
>>>  #else
>>> @@ -877,7 +877,7 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
>>>       * Prepare the CPU for reboot _after_ invoking the callback so that the
>>>       * callback can safely use virtualization instructions, e.g. VMCLEAR.
>>>       */
>>> -    cpu_emergency_disable_virtualization();
>>> +    cpu_emergency_disable_virtualization(false);
>>>  
>>>      atomic_dec(&waiting_for_crash_ipi);
>>>  
>>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>>> index 18266cc3d98c..9a863348d1a7 100644
>>> --- a/arch/x86/kernel/smp.c
>>> +++ b/arch/x86/kernel/smp.c
>>> @@ -124,7 +124,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
>>>      if (raw_smp_processor_id() == atomic_read(&stopping_cpu))
>>>          return NMI_HANDLED;
>>>  
>>> -    cpu_emergency_disable_virtualization();
>>> +    cpu_emergency_disable_virtualization(false);
>>>      stop_this_cpu(NULL);
>>>  
>>>      return NMI_HANDLED;
>>> @@ -136,7 +136,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
>>>  DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
>>>  {
>>>      apic_eoi();
>>> -    cpu_emergency_disable_virtualization();
>>> +    cpu_emergency_disable_virtualization(false);
>>>      stop_this_cpu(NULL);
>>>  }
>>>  
>>>
>>> And then a second patch adds sev_emergency_disable() and only
>>> executes it if last == true.
>>>
>> This implementation will not work as we need to do wbinvd on all other CPUs after SNP_DECOMMISSION has been issued.
>>
>> When the last CPU executes sev_emergency_disable() and issues SNP_DECOMMISSION, by that time all other CPUs have entered the NMI halt loop and then they won't be able to do a wbinvd and hence SNP_SHUTDOWN will eventually fail.
>>
>> One way this can work is if all other CPUs can still execute sev_emergency_disable() and in case of last == false, do a spin busy till the last cpu kicks them out of the spin loop and then they do a wbinvd after exiting the spin busy, but then cpu_emergency_disable_virtualization() is/has to be called before atomic_dec(&waiting_for_crash_ipi) in crash_nmi_callback(), so this spin busy in other CPUs will force the last CPU to wait forever (or till it times out waiting for all other CPUs to execute the NMI callback) which makes all of this looks quite fragile.
> Hi Ashish,
>
> Your patch (and Paolo's suggested rework) came up in the PUCK call this
> morning and I mentioned this point. I was asked to raise some of the
> points here so we can continue the discussion on-list:
>
> Have we confirmed that WBINVD actually has to happen after the
> SNP_DECOMISSION call? Or do we just need to ensure that the WBINVD was
> issued after the last VMEXIT, and that the guest will never VMENTER
> again after the WBINVD?
>
> Because if WBINVD can be done prior to SNP_DECOMISSION, then Paolo was
> suggesting we could just have:
>
>   kvm_cpu_svm_disable() {
>     ...
>     WBINVD
>   }
>
>   cpu_emergency_disable_virtualization() {
>     kvm_cpu_svm_disable()
>   }

Something similar is already happening in crash_nmi_callback() :

static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
{
        ...
        if (shootdown_callback)
                shootdown_callback(cpu, regs);

        cpu_emergency_disable_virtualization();

        ...

shootdown_callback() -> kdump_nmi_callback() -> kdump_sev_callback() - > WBINVD

and that does not help and still causes SNP_SHUTDOWN_EX to fail after SNP_DECOMMISSION with DFFLUSH_REQUIRED error and then subsequently issuing DF_FLUSH command after SNP_SHUTDOWN failing with WBINVD_REQUIRED error.

>
>   smp_stop_nmi_callback() {
>     if (raw_smp_processor_id() == atomic_read(&stopping_cpu)) {
>       return NMI_HANDLED;
>     }
>     cpu_emergency_disable_virtualization()
>     return NMI_HANDLED
>   }
>
>
> The panic'ing CPU can then:
>
>   1) WBINVD itself (e.g. via cpu_emergency_disable_virtualization())
>   2) issue SNP_DECOMMISSION for all ASIDs
>
> That would avoid much of the additional synchronization handling since it
> fits more cleanly into existing panic flow. But it's contingent on
> whether WBINVD can happen before SNP_DECOMMISION or not so need to
> confirm that.

No, WBINVD needs to happen after SNP_DECOMMISSION has been issued, followed by a DF_FLUSH for SNP_SHUTDOWN_EX to succeed.

from the SNP FW ABI specs for SNP_DECOMMISSION command:

The firmware marks the ASID of the guest as not runnable. Then the firmware records that each CPU core on each of the CCXs that the guest was activated on requires a WBINVD followed by a single DF_FLUSH command to ensure that all unencrypted data in the caches are invalidated before reusing the ASID.

and from the SNP FW ABI specs for SNP_DF_FLUSH command:

8.13     SNP_DF_FLUSH command:

8.13.2     Actions

For each core marked for cache invalidation, the firmware checks that the core has executed a WBINVD instruction. If not, the firmware returns WBINVD_REQUIRED. The commands that mark cores for cache invalidation include SNP_DECOMMISSION ...

I understand this Actions section as saying WBINVD and DF_FLUSH would have to be after the decommission.

> Sean and Paolo also raised some other points (feel free to add if I
> missed anything):
>
>   - Since there's a lot of high-level design aspects that need to be
>     accounted for, it would be good to have the patch have some sort of
>     pseudocode/graph/etc so it's easier to reason about different
>     approaches. It would also be good to include this in the commit
>     message (generally it's encouraged to describe "why" rather than "how",
>     but this is an exceptional case where both are useful to reader).
>
>   - Sean inquired about making the target kdump kernel more agnostic to
>     whether or not SNP_SHUTDOWN was done properly, since that might
>     allow for capturing state even for edge cases where we can't go
>     through the normal cleanup path. I mentioned we'd tried this to some
>     degree but hit issues with the IOMMU, and when working around that
>     there was another issue but I don't quite recall the specifics.
>     Can you post a quick recap of what the issues are with that approach
>     so we can determine whether or not this is still an option?

Yes, i believe without SNP_SHUTDOWN, early_enable_iommus() configure the IOMMUs into an IRQ remapping configuration causing the crash in io_apic.c::check_timer().

It looks like in this case, we enable IRQ remapping configuration *earlier* than when it needs to be enabled and which causes the panic as indicated:

EMERGENCY [    1.376701] Kernel panic - not syncing: timer doesn't work through Interrupt-remapped IO-APIC

Next, we tried with amd_iommu=off, with that we don't get the irq remapping panic during crashkernel boot, but boot still hangs before starting kdump tools.

So eventually we discovered that irqremapping is required for x2apic and with amd_iommu=off we don't enable irqremapping at all.

And without irqremapping enabled, NVMe is not being detected and RootFS not getting mounted (during crashkernel boot):

...

[   67.902284] nvme nvme0: I/O tag 8 (0008) QID 0 timeout, disable controller
[   67.915403] nvme nvme1: I/O tag 12 (000c) QID 0 timeout, disable controller
[   67.928880] nvme nvme0: Identify Controller failed (-4)
[   67.940465] nvme nvme1: Identify Controller failed (-4)
[   67.951758] nvme 0000:41:00.0: probe with driver nvme failed with error -5
[   67.964865] nvme 0000:01:00.0: probe with driver nvme failed with error -5

...

Gave up waiting for root file system device.

...

Thanks, Ashish