On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote: > On 11/29/21 19:55, Sean Christopherson wrote: > > > Still it does seem to be a race that happens when IS_RUNNING=true but > > > vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to > > > trigger because it moves IS_RUNNING=false later. > > > > Oh! Any chance the bug only repros with preemption enabled? That would explain > > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n. > > Me too. > > > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running() > > passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU, > > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info. > > That would make a lot of sense. avic_vcpu_load() can handle > svm->avic_is_running = false, but avic_set_running still needs its body > wrapped by preempt_disable/preempt_enable. > > Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his > own build so it would not surprise me if he used CONFIG_PREEMPT=y. > > Paolo > I will write ll the details tomorrow but I strongly suspect the CPU errata https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf #1235 Basically what I see that 1. vCPU2 disables is_running in avic physical id cache 2. vCPU2 checks that IRR is empty and it is 3. vCPU2 does schedule(); and it keeps on sleeping forever. If I kick it via signal (like just doing 'info registers' qemu hmp command or just stop/cont on the same hmp interface, the vCPU wakes up and notices that IRR suddenly is not empty, and the VM comes back to life (and then hangs after a while again with the same problem....). As far as I see in the traces, the bit in IRR came from another VCPU who didn't respect the ir_running bit and didn't get AVIC_INCOMPLETE_IPI VMexit. I can't 100% prove it yet, but everything in the trace shows this. About the rest of the environment, currently I reproduce this in a VM which has no pci passed through devices at all, just AVIC. (I wasn't able to reproduce it before just because I forgot to enable AVIC in this configuration). So I also agree that Sean's patch is not to blame here, it just made the window between setting is_running and getting to sleep shorter and made it less likely that other vCPUs will pick up the is_running change. (I suspect that they pick it up on next vmrun, and otherwise the value is somehow cached wrongfully in them). A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block does seem to work for me but it skews all the timing off so I can't prove it. That is all, I will write more detailed info, including some traces I have. I do use windows 10 with so called LatencyMon in it, which shows overall how much latency hardware interrupts have, which used to be useful for me to ensure that my VMs are suitable for RT like latency (even before I joined RedHat, I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which needs RT like latencies. These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it). I really need to write a kvm unit test to stress test IPIs, especially this case, I will do this very soon. Wei Huang, any info on this would be very helpful. Maybe putting the avic physical table in UC memory would help? Maybe ringing doorbells of all other vcpus will help them notice the change? Best regards, Maxim Levitsky