On 05/08/20 02:07, Paul E. McKenney wrote: > > We are seeing occasional odd hangs, but only in cases where guest OSes > are being migrated. Migrating more often makes the hangs happen more > frequently. > > Added debug showed that the hung CPU is stuck trying to send an IPI (e.g., > smp_call_function_single()). The hung CPU thinks that it has sent the > IPI, but the destination CPU has interrupts enabled (-not- disabled, > enabled, as in ready, willing, and able to take interrupts). In fact, > the destination CPU usually is going about its business as if nothing > was wrong, which makes me suspect that the IPI got lost somewhere along > the way. > > I bumbled a bit through the qemu and KVM source, and didn't find anything > synchronizing IPIs and migrations, though given that I know pretty much > nothing about either qemu or KVM, this doesn't count for much. The code migrating the interrupt controller is in kvm_x86_ops.sync_pir_to_irr (which calls vmx_sync_pir_to_irr) and kvm_apic_get_state. kvm_apic_get_state is called after CPUs are stopped. It's possible that we're missing a kvm_x86_ops.sync_pir_to_irr call somewhere. It would be surprising but it would explain the symptoms very well. Paolo