On (21/05/22 00:56), Sergey Senozhatsky wrote: > Soft watchdog timer function checks if a virtual machine > was suspended and hence what looks like a lockup in fact > is a false positive. > > This is what kvm_check_and_clear_guest_paused() does: it > tests guest PVCLOCK_GUEST_STOPPED (which is set by the host) > and if it's set then we need to touch all watchdogs and bail > out. > > Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED > check works fine. > > There is, however, one more watchdog that runs from IRQ, so > watchdog timer fn races with it, and that watchdog is not aware > of PVCLOCK_GUEST_STOPPED - RCU stall detector. > > apic_timer_interrupt() > smp_apic_timer_interrupt() > hrtimer_interrupt() > __hrtimer_run_queues() > tick_sched_timer() > tick_sched_handle() > update_process_times() > rcu_sched_clock_irq() > > This triggers RCU stalls on our devices during VM resume. > > If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU > before watchdog_timer_fn()->kvm_check_and_clear_guest_paused() > then there is nothing on this VCPU that touches watchdogs and > RCU reads stale gp stall timestamp and new jiffies value, which > makes it think that RCU has stalled. > > Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and > don't report RCU stalls when we resume the VM. Hello Paul, I've noticed that this patch set didn't make it to Linus's tree. Was it intentional?