Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM

Eric Mackay <eric.mackay@xxxxxxxxxx> · Mon, 30 Sep 2024 16:34:57 -0700

> On Thu, 26 Sep 2024 18:22:39 -0700
> Eric Mackay <eric.mackay@xxxxxxxxxx> wrote:
> > > On 9/24/24 5:40 AM, Igor Mammedov wrote:  
> > >> On Fri, 19 Apr 2024 12:17:01 -0400
> > >> boris.ostrovsky@xxxxxxxxxx wrote:
> > >>   
> > >>> On 4/17/24 9:58 AM, boris.ostrovsky@xxxxxxxxxx wrote:  
> > >>>>
> > >>>> I noticed that I was using a few months old qemu bits and now I am
> > >>>> having trouble reproducing this on latest bits. Let me see if I can get
> > >>>> this to fail with latest first and then try to trace why the processor
> > >>>> is in this unexpected state.  
> > >>>
> > >>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
> > >>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
> > >>>
> > >>> I need to understand whether lack of failures is a side effect of timing
> > >>> changes that simply make hotplug fail less likely or if this is an
> > >>> actual (but seemingly unintentional) fix.  
> > >> 
> > >> Agreed, we should find out culprit of the problem.  
> > >
> > >
> > > I haven't been able to spend much time on this unfortunately, Eric is 
> > > now starting to look at this again.
> > >
> > > One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
> > > vcpus serially while on HW my understanding is that this is done as a 
> > > broadcast so I thought this could cause a race. I had a quick test with 
> > > pausing and resuming all vcpus around the loop but that didn't help.
> > >
> > >  
> > >> 
> > >> PS:
> > >> also if you are using AMD host, there was a regression in OVMF
> > >> where where vCPU that OSPM was already online-ing, was yanked
> > >> from under OSMP feet by OVMF (which depending on timing could
> > >> manifest as lost SIPI).
> > >> 
> > >> edk2 commit that should fix it is:
> > >>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
> > >> 
> > >> Switching to Intel host should rule that out at least.
> > >> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> > >> if you are forced to use AMD host)  
> > 
> > I haven't been able to reproduce the issue on an Intel host thus far,
> > but it may not be an apples-to-apples comparison because my AMD hosts
> > have a much higher core count.
> > 
> > >
> > > I just tried with latest bits that include this commit and still was 
> > > able to reproduce the problem.
> > >
> > >
> > >-boris  
> > 
> > The initial hotplug of each CPU appears to complete from the
> > perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
> > CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
> > sequence sent from the guest that doesn't complete.
> > 
> > My working theory has been that some CPU/AP is lagging behind the others
> > when the BSP is waiting for all the APs to go into SMM, and the BSP just
> > gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
> > CPU does finally go into SMM, and other CPUs are in normal mode.
> > 
> > I've been able to observe the SMI handler for the problematic CPU will
> > sometimes start running when no BSP is elected. This means we have a
> > window of time where the CPU will ignore SIPI, and least 1 CPU is in
> > normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
> > the guest.
> 
> I've re-read whole thread and noticed Boris were saying:
>   > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@xxxxxxxxxx> wrote:
>   > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:  
>   ...
>   > > >
>   > > > What is the reproducer for this?  
>   > >
>   > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>   > > the guest, will get you there in 10-15 minutes.
>   ...
> 
> So there was unplug involved as well, which was broken since forever.
> 
> Recent patch
>  https://patchew.org/QEMU/20230427211013.2994127-1-alxndr@xxxxxx/20230427211013.2994127-2-alxndr@xxxxxx/
> has exposed issue (unexpected uplug/unplug flow) with root cause in OVMF.
> Firmware was letting non involved APs run wild in normal mode.
> As result AP that was calling _EJ0 and holding ACPI lock was
> continuing _EJ0 and releasing ACPI lock, while BSP and a being removed
> CPU were still in SMM world. And any other plug/unplug op
> were able to grab ACPI lock and trigger another SMI, which breaks
> hotplug flow expectations (aka exclusive access to hotplug registers
> during plug/unplug op)
> Perhaps that's what you are observing.
> 
> Please check if following helps:
>   https://github.com/kraxel/edk2/commit/738c09f6b5ab87be48d754e62deb72b767415158
> 

I haven't actually seen the guest crash during unplug, though certainly
there have been unplug failures. I haven't been keeping track of the
unplug failures as closely, but a test I ran over the weekend with this
patch added seemed to show less unplug failures.

I'm still getting hotplug failures that cause a guest crash though, so
that mystery remains.

> So yes, SIPI can be lost (which should be expected as others noted)
> but that normally shouldn't be an issue as wakeup_secondary_cpu_via_init()
> do resend SIPI.
> However if wakeup_secondary_cpu is set to another handler that doesn't
> resend SIPI, It might be an issue.

We're using wakeup_secondary_cpu_via_init(). acpi_wakeup_cpu() and
wakeup_cpu_via_vmgexit(), for example, are a bit opaque to me, so I'm
not sure if those code paths include a SIPI resend.