Re: Inflight Corruption of XFS filesystem on CentOS 7.7 VMs

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 19 Nov 2019 08:54:55 -0500

On Sat, Nov 16, 2019 at 06:51:41PM +1300, Patrick Rynhart wrote:
> On Sat, 16 Nov 2019 at 18:29, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:
> >
> > On 11/15/19 9:33 PM, Patrick Rynhart wrote:
> > > Hi all,
> > >
> > > A small number of our CentOS VMs (about 4 out of a fleet of 200) are
> > > experiencing ongoing, regular XFS corruption - and I'm not sure how to
> > > troubleshoot the problem.  They are all CentOS 7.7 VMs are are using
> > > VMWare Paravirtual SCSI.  The version of xfsprogs being used is
> > > 4.5.0-20.el7.x86_64, and the kernel is 3.10.0-1062.1.2.el7.x86_64.
> > > The VMWare version is ESXi, 6.5.0, 14320405.
> > >
> > > When the fault happens - the VMs will go into single user mode with
> > > the following text displayed on the console:
> > >
> > > sd 0:0:0:0: [sda] Assuming drive cache: write through
> > > XFS (dm-0): Internal error XFS_WANT_CORRUPTED_GOTO at line 1664 of
> > > file fs/xfs/libxfs
> > > /xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs]
> > > XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file
> > > fs/xfs/xfs_trans.c.
> > > Caller xfs_efi_recover+0x17d/0x1a0 [xfs]
> > > XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
> > > XFS (dm-0): Please umount the filesystem and rectify the problem(s)
> > > XFS (dm-0): Failed to recover intents
> >
> > Seems like this is not the whole relevant log; "Failed to recover intents"
> > indicates it was in log replay but we don't see that starting.  Did you
> > cut out other interesting bits?
> 
> Thank you for the reply.  When the problem happens the system ends up
> in the EL7 dracut emergency shell.  Here's a picture of what the
> console looks like right now (I haven't rebooted yet):
> 
> https://pasteboard.co/IGUpPiN.png
> 
> How can I get some debug information re the (attempted ?) log replay
> for debug / analysis ?
> 

At this point I'm not sure there's a ton to gain from recovery analysis.
The filesystem shows free space corruption where on log recovery, it is
attempting to free some space that is already marked free. The
corruption occurred some time in the past and recovery is just the first
place we detect it and can fail. What we really want to find out is how
this corruption is introduced in the first place. That may not be
trivial, but it might be possible with instrumentation or custom debug
code if you can reproduce this reliably enough and are willing to go
that route. When you say 4 out of 200 VMs show this problem, is it
consistently the same set of VMs or is that just the rate of failure of
random guests out of the 200?

Logistical questions aside, I think the first technical question to
answer is why are you in recovery in the first place? We'd want to know
that because that could rule out a logging/recovery problem vs. a
runtime bug introducing the corruption. Recovery should only be required
after a crash or unclean shutdown. Do you know what kind of event caused
the unclean shutdown? Did you see a runtime crash and filesystem
shutdown with a similar corruption report as shown here, or was it an
unrelated event? Please post system log output if you happen to have a
record of an instance of the former. If there is such a corruption
report, an xfs_metadump of the filesystem might also be useful to look
at before you run xfs_repair.

Brian

> > -Eric
>