Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 7 Dec 2021 22:25:51 +0000

On Tue, Dec 07, 2021, Chris Murphy wrote:
> cc: qemu-devel
> 
> Hi,
> 
> I'm trying to help progress a very troublesome and so far elusive bug
> we're seeing in Fedora infrastructure. When running dozens of qemu-kvm
> VMs simultaneously, eventually they become unresponsive, as well as
> new processes as we try to extract information from the host about
> what's gone wrong.

Have you tried bisecting?  IIUC, the issues showed up between v5.11 and v5.12.12,
bisecting should be relatively straightforward.

> Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
> state where forking does not work correctly, breaking most things
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585
> 
> In subsequent testing, we used newer kernels with lockdep and other
> debug stuff enabled, and managed to capture a hung task with a bunch
> of locks listed, including kvm and qemu processes. But I can't parse
> it.
> 
> 5.15-rc7
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
> 5.15+
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939
> 
> If anyone can take a glance at those kernel messages, and/or give
> hints how we can extract more information for debugging, it'd be
> appreciated. Maybe all of that is normal and the actual problem isn't
> in any of these traces.

All the instances of

  (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm]

are uninteresting and expected, that's just each vCPU task taking its associated
vcpu->mutex, likely for KVM_RUN.

At a glance, the XFS stuff looks far more interesting/suspect.