Re: VMs getting into stuck states since kernel ~5.13

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Dec 08, 2021 at 01:54:02PM -0500, Chris Murphy wrote:
> Hi,
> 
> I'm trying to help progress a kernel regression hitting Fedora
> infrastructure in which dozens of VMs run concurrently to execute QA
> testing. The problem doesn't happen immediately, but all the VM's get
> stuck and then any new process also gets stuck, so extracting
> information from the system has been difficult and there's not a lot
> to go on, but this is what I've got so far.
> 
> Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
> state where forking does not work correctly, breaking most things
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585
> 
> In that bug some items of interest ...
> 
> This megaraid_sas trace. The hang hasn't happened at this point
> though, so it may not be related at all or it might be an instigator.
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585#c31

That's indicative of a bio handling bug somewhere in the storage
stack, likely the MD RAID layer...

> Once there is a hang, we have these traces from reducing the time for
> the kernel to report blocked tasks. Much of the messages I'm told from
> kvm/qemu folks are pretty ordinary/expected locks. But the XFS
> portions might give a clue what's going on?
> 
> 5.15-rc7
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941

So you have processes waiting on both journal IO completion,
(xlog_wait_on_iclog()) and data IO completion
(wait_on_page_writeback()).

> 5.15+
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939

And same here, except it is folio_wait_writeback() in this one.

They are all waiting for the storage to complete IOs.

> So I can imagine the VM's are stuck because XFS is stuck. And XFS is
> stuck because something in the block layer or megaraid driver is
> stuck, but I don't know that for certain.

Looking at the traces, I'd say IO is really slow, but not stuck.
`iostat -dxm 5` output for a few minutes will tell you if IO is
actually making progress or not.

Can you please provide the hardware configuration for these machines
and iostat output before we go any further here?

https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux