Re: Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll on kernel 6.3

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 3 May 2023 08:02:58 +1000

On Tue, May 02, 2023 at 02:14:34PM -0500, Mike Pastore wrote:
> Hi folks,
> 
> I was playing around with some blockchain projects yesterday and had
> some curious crashes while syncing blockchain databases on XFS
> filesystems under kernel 6.3.
> 
>   * kernel 6.3.0 and 6.3.1 (ubuntu mainline)
>   * w/ and w/o the discard mount flag
>   * w/ and w/o -m crc=0
>   * ironfish (nodejs) and ergo (jvm)
> 
> The hardware is as follows:
> 
>   * Asus PRIME H670-PLUS D4
>   * Intel Core i5-12400
>   * 32GB DDR4-3200 Non-ECC UDIMM
> 
> In all cases the filesystems were newly-created under kernel 6.3 on an
> LVM2 stripe and mounted with the noatime flag. Here is the output of
> the mkfs.xfs command (after reverting back to 6.2.14—which I realize
> may not be the most helpful thing, but here it is anyway):
> 
> $ sudo lvremove -f vgtethys/ironfish
> $ sudo lvcreate -n ironfish-L 10G -i2 vgtethys /dev/nvme[12]n1p3
>   Using default stripesize 64.00 KiB.
>   Logical volume "ironfish" created.
> $ sudo mkfs.xfs -m crc=0 -m uuid=b4725d43-a12d-42df-981a-346af2809fad
> -s size=4096 /dev/vgtethys/ironfish
> meta-data=/dev/vgtethys/ironfish isize=256    agcount=16, agsize=163824 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0    bigtime=0 inobtcount=0
> data     =                       bsize=4096   blocks=2621184, imaxpct=25
>          =                       sunit=16     swidth=32 blks

Stripe aligned allocation is enabled. Does the problem go away
when you use mkfs.xfs -d noalign .... ?

> The applications crash with I/O errors. Here's what I see in dmesg:
> 
> May 01 18:56:59 tethys kernel: XFS (dm-28): Internal error bno + len >
> gtbno at line 1908 of file fs/xfs/libxfs/xfs_alloc.c.  Caller
> xfs_free_ag_extent+0x14e/0x950 [xfs]

                        /*                                                       
                         * If this failure happens the request to free this      
                         * space was invalid, it's (partly) already free.        
                         * Very bad.                                             
                         */                                                      
                        if (XFS_IS_CORRUPT(mp, ltbno + ltlen > bno)) {           
                                error = -EFSCORRUPTED;                           
                                goto error0;                                     
                        }                                                        

That failure implies the btree records are corrupt in memory,
possibly due to memory corruption from something outside the XFS
code (e.g. use after free).

> May 01 18:56:59 tethys kernel: CPU: 2 PID: 48657 Comm: node Tainted: P
>           OE      6.3.1-060301-generic #202304302031

The kernel being run has been tainted by out of tree proprietary
drivers (a common source of memory corruption bugs in my
experience). Can you reproduce this problem with an untainted
kernel?

....

> And here's what I see in dmesg after rebooting and attempting to mount
> the filesystem to replay the log:
> 
> May 01 21:34:15 tethys kernel: XFS (dm-35): Metadata corruption
> detected at xfs_inode_buf_verify+0x168/0x190 [xfs], xfs_inode block
> 0x1405a0 xfs_inode_buf_verify
> May 01 21:34:15 tethys kernel: XFS (dm-35): Unmount and run xfs_repair
> May 01 21:34:15 tethys kernel: XFS (dm-35): First 128 bytes of
> corrupted metadata buffer:
> May 01 21:34:15 tethys kernel: 00000000: 5b 40 e2 3a ae 52 a0 7a 17 1d

That's not an inode buffer. It's not recognisable as XFS metadata at
all, which indicates some other problem.

Oh, this was from a test with "mkfs.xfs -m crc=0 ...", right? Please
don't use "-m crc=0" - that format is deprecated partly because it
has unfixable on-disk format recovery issues. One of those issues
manifests as an inode recovery failure because the underlying inode
buffer allocation/init does not get replayed correctly before we
attempt to replay inode changes into the buffer (that has not be
initialised)....

i.e. one of those unfixable issues manifest exactly like the
recovery failure being reported here.

> Blockchain projects tend to generate pathological filesystem loads;
> the sustained random write activity and constant (re)allocations must
> be pushing on some soft spot here.

There was a significant allocator infrastructure rewrite in 6.3. If
running an untainted kernel on an unaligned, CRC enabled filesystem
makes the problems go away, then it rules out known issues with the
rewrite.

Alternatively, if it is reproducable in a short time, you may be
able to bisect the XFS changes that landed between 6.2 and 6.3 to
find which change triggers the problem.

> Reverting to kernel 6.2.14 and
> recreating the filesystems seems to have resolved the issue—so far, at
> least—but obviously this is less than ideal. If someone would be
> willing to provide a targeted listed of desired artifacts I'd be happy
> to boot back into kernel 6.3.1 to reproduce the issue and collect
> them. Alternatively I can try to eliminate some variables (like LVM2,
> potential hardware instabilities, etc.) and provide step-by-step
> directions for reproducing the issue on another machine.

If you can find a minimal reproducer, that would help a lot in
diagnosing the issue.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx