Re: Inflight Corruption of XFS filesystem on CentOS 7.7 VMs

Patrick Rynhart <patrick@xxxxxxxxxxxxx> · Sat, 16 Nov 2019 18:51:41 +1300

On Sat, 16 Nov 2019 at 18:29, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:
>
> On 11/15/19 9:33 PM, Patrick Rynhart wrote:
> > Hi all,
> >
> > A small number of our CentOS VMs (about 4 out of a fleet of 200) are
> > experiencing ongoing, regular XFS corruption - and I'm not sure how to
> > troubleshoot the problem.  They are all CentOS 7.7 VMs are are using
> > VMWare Paravirtual SCSI.  The version of xfsprogs being used is
> > 4.5.0-20.el7.x86_64, and the kernel is 3.10.0-1062.1.2.el7.x86_64.
> > The VMWare version is ESXi, 6.5.0, 14320405.
> >
> > When the fault happens - the VMs will go into single user mode with
> > the following text displayed on the console:
> >
> > sd 0:0:0:0: [sda] Assuming drive cache: write through
> > XFS (dm-0): Internal error XFS_WANT_CORRUPTED_GOTO at line 1664 of
> > file fs/xfs/libxfs
> > /xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs]
> > XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file
> > fs/xfs/xfs_trans.c.
> > Caller xfs_efi_recover+0x17d/0x1a0 [xfs]
> > XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
> > XFS (dm-0): Please umount the filesystem and rectify the problem(s)
> > XFS (dm-0): Failed to recover intents
>
> Seems like this is not the whole relevant log; "Failed to recover intents"
> indicates it was in log replay but we don't see that starting.  Did you
> cut out other interesting bits?

Thank you for the reply.  When the problem happens the system ends up
in the EL7 dracut emergency shell.  Here's a picture of what the
console looks like right now (I haven't rebooted yet):

https://pasteboard.co/IGUpPiN.png

How can I get some debug information re the (attempted ?) log replay
for debug / analysis ?

> -Eric