On Tue, Dec 07, 2021, Chris Murphy wrote: > cc: qemu-devel > > Hi, > > I'm trying to help progress a very troublesome and so far elusive bug > we're seeing in Fedora infrastructure. When running dozens of qemu-kvm > VMs simultaneously, eventually they become unresponsive, as well as > new processes as we try to extract information from the host about > what's gone wrong. Have you tried bisecting? IIUC, the issues showed up between v5.11 and v5.12.12, bisecting should be relatively straightforward. > Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a > state where forking does not work correctly, breaking most things > https://bugzilla.redhat.com/show_bug.cgi?id=2009585 > > In subsequent testing, we used newer kernels with lockdep and other > debug stuff enabled, and managed to capture a hung task with a bunch > of locks listed, including kvm and qemu processes. But I can't parse > it. > > 5.15-rc7 > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941 > 5.15+ > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939 > > If anyone can take a glance at those kernel messages, and/or give > hints how we can extract more information for debugging, it'd be > appreciated. Maybe all of that is normal and the actual problem isn't > in any of these traces. All the instances of (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm] are uninteresting and expected, that's just each vCPU task taking its associated vcpu->mutex, likely for KVM_RUN. At a glance, the XFS stuff looks far more interesting/suspect.