Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM

boris.ostrovsky@xxxxxxxxxx · Tue, 24 Sep 2024 17:59:39 -0400

On 9/24/24 5:40 AM, Igor Mammedov wrote:
On Fri, 19 Apr 2024 12:17:01 -0400
boris.ostrovsky@xxxxxxxxxx wrote:

On 4/17/24 9:58 AM, boris.ostrovsky@xxxxxxxxxx wrote:

I noticed that I was using a few months old qemu bits and now I am
having trouble reproducing this on latest bits. Let me see if I can get
this to fail with latest first and then try to trace why the processor
is in this unexpected state.

Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
under if (!dev) in qmp_device_add()" is what makes the test to stop failing.

I need to understand whether lack of failures is a side effect of timing
changes that simply make hotplug fail less likely or if this is an
actual (but seemingly unintentional) fix.

Agreed, we should find out culprit of the problem.

I haven't been able to spend much time on this unfortunately, Eric is 
now starting to look at this again.

One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
vcpus serially while on HW my understanding is that this is done as a 
broadcast so I thought this could cause a race. I had a quick test with 
pausing and resuming all vcpus around the loop but that didn't help.

PS:
also if you are using AMD host, there was a regression in OVMF
where where vCPU that OSPM was already online-ing, was yanked
from under OSMP feet by OVMF (which depending on timing could
manifest as lost SIPI).

edk2 commit that should fix it is:
     https://github.com/tianocore/edk2/commit/1c19ccd5103b

Switching to Intel host should rule that out at least.
(or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
if you are forced to use AMD host)

I just tried with latest bits that include this commit and still was 
able to reproduce the problem.

-boris