VMs getting into stuck states since kernel ~5.13

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 8 Dec 2021 13:54:02 -0500

Hi,

I'm trying to help progress a kernel regression hitting Fedora
infrastructure in which dozens of VMs run concurrently to execute QA
testing. The problem doesn't happen immediately, but all the VM's get
stuck and then any new process also gets stuck, so extracting
information from the system has been difficult and there's not a lot
to go on, but this is what I've got so far.

Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
state where forking does not work correctly, breaking most things
https://bugzilla.redhat.com/show_bug.cgi?id=2009585

In that bug some items of interest ...

This megaraid_sas trace. The hang hasn't happened at this point
though, so it may not be related at all or it might be an instigator.
https://bugzilla.redhat.com/show_bug.cgi?id=2009585#c31

Once there is a hang, we have these traces from reducing the time for
the kernel to report blocked tasks. Much of the messages I'm told from
kvm/qemu folks are pretty ordinary/expected locks. But the XFS
portions might give a clue what's going on?

5.15-rc7
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
5.15+
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939

So I can imagine the VM's are stuck because XFS is stuck. And XFS is
stuck because something in the block layer or megaraid driver is
stuck, but I don't know that for certain.

Thanks,

-- 
Chris Murphy