Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll on kernel 6.3

Mike Pastore <mike@xxxxxxxxx> · Tue, 2 May 2023 14:14:34 -0500

Hi folks,

I was playing around with some blockchain projects yesterday and had
some curious crashes while syncing blockchain databases on XFS
filesystems under kernel 6.3.

  * kernel 6.3.0 and 6.3.1 (ubuntu mainline)
  * w/ and w/o the discard mount flag
  * w/ and w/o -m crc=0
  * ironfish (nodejs) and ergo (jvm)

The hardware is as follows:

  * Asus PRIME H670-PLUS D4
  * Intel Core i5-12400
  * 32GB DDR4-3200 Non-ECC UDIMM

In all cases the filesystems were newly-created under kernel 6.3 on an
LVM2 stripe and mounted with the noatime flag. Here is the output of
the mkfs.xfs command (after reverting back to 6.2.14—which I realize
may not be the most helpful thing, but here it is anyway):

$ sudo lvremove -f vgtethys/ironfish
$ sudo lvcreate -n ironfish-L 10G -i2 vgtethys /dev/nvme[12]n1p3
  Using default stripesize 64.00 KiB.
  Logical volume "ironfish" created.
$ sudo mkfs.xfs -m crc=0 -m uuid=b4725d43-a12d-42df-981a-346af2809fad
-s size=4096 /dev/vgtethys/ironfish
meta-data=/dev/vgtethys/ironfish isize=256    agcount=16, agsize=163824 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=2621184, imaxpct=25
         =                       sunit=16     swidth=32 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.

The applications crash with I/O errors. Here's what I see in dmesg:

May 01 18:56:59 tethys kernel: XFS (dm-28): Internal error bno + len >
gtbno at line 1908 of file fs/xfs/libxfs/xfs_alloc.c.  Caller
xfs_free_ag_extent+0x14e/0x950 [xfs]
May 01 18:56:59 tethys kernel: CPU: 2 PID: 48657 Comm: node Tainted: P
          OE      6.3.1-060301-generic #202304302031
May 01 18:56:59 tethys kernel: Hardware name: ASUS System Product
Name/PRIME H670-PLUS D4, BIOS 2014 10/14/2022
May 01 18:56:59 tethys kernel: Call Trace:
May 01 18:56:59 tethys kernel:  <TASK>
May 01 18:56:59 tethys kernel:  dump_stack_lvl+0x48/0x70
May 01 18:56:59 tethys kernel:  dump_stack+0x10/0x20
May 01 18:56:59 tethys kernel:  xfs_corruption_error+0x9e/0xb0 [xfs]
May 01 18:56:59 tethys kernel:  ? xfs_free_ag_extent+0x14e/0x950 [xfs]
May 01 18:56:59 tethys kernel:  xfs_free_ag_extent+0x17c/0x950 [xfs]
May 01 18:56:59 tethys kernel:  ? xfs_free_ag_extent+0x14e/0x950 [xfs]
May 01 18:56:59 tethys kernel:  __xfs_free_extent+0xee/0x1e0 [xfs]
May 01 18:56:59 tethys kernel:  xfs_trans_free_extent+0xad/0x1a0 [xfs]
May 01 18:56:59 tethys kernel:  xfs_extent_free_finish_item+0x14/0x40 [xfs]
May 01 18:56:59 tethys kernel:  xfs_defer_finish_one+0xd9/0x280 [xfs]
May 01 18:56:59 tethys kernel:  xfs_defer_finish_noroll+0xab/0x280 [xfs]
May 01 18:56:59 tethys kernel:  xfs_defer_finish+0x16/0x80 [xfs]
May 01 18:56:59 tethys kernel:  xfs_itruncate_extents_flags+0xe3/0x270 [xfs]
May 01 18:56:59 tethys kernel:  xfs_free_eofblocks+0xe3/0x130 [xfs]
May 01 18:56:59 tethys kernel:  xfs_release+0x153/0x190 [xfs]
May 01 18:56:59 tethys kernel:  xfs_file_release+0x15/0x20 [xfs]
May 01 18:56:59 tethys kernel:  __fput+0x95/0x270
May 01 18:56:59 tethys kernel:  ____fput+0xe/0x20
May 01 18:56:59 tethys kernel:  task_work_run+0x5e/0xa0
May 01 18:56:59 tethys kernel:  exit_to_user_mode_loop+0x136/0x160
May 01 18:56:59 tethys kernel:  exit_to_user_mode_prepare+0xff/0x110
May 01 18:56:59 tethys kernel:  syscall_exit_to_user_mode+0x1b/0x50
May 01 18:56:59 tethys kernel:  do_syscall_64+0x67/0x90
May 01 18:56:59 tethys kernel:  ? syscall_exit_to_user_mode+0x44/0x50
May 01 18:56:59 tethys kernel:  ? do_syscall_64+0x67/0x90
May 01 18:56:59 tethys kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 01 18:56:59 tethys kernel: RIP: 0033:0x7f8fce72c6a7
May 01 18:56:59 tethys kernel: Code: 44 00 00 48 8b 15 e9 d7 0d 00 f7
d8 64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 >
May 01 18:56:59 tethys kernel: RSP: 002b:00007f8fb2a67a78 EFLAGS:
00000202 ORIG_RAX: 0000000000000003
May 01 18:56:59 tethys kernel: RAX: 0000000000000000 RBX:
00007f8f98019420 RCX: 00007f8fce72c6a7
May 01 18:56:59 tethys kernel: RDX: 00007f8fce806880 RSI:
00007f8f982a9b40 RDI: 000000000000004c
May 01 18:56:59 tethys kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007f8fc02c5520
May 01 18:56:59 tethys kernel: R10: 0000000000000064 R11:
0000000000000202 R12: 00007f8fce807480
May 01 18:56:59 tethys kernel: R13: 0000000000006be1 R14:
0000000000000019 R15: 00007f8f980a8b50
May 01 18:56:59 tethys kernel:  </TASK>
May 01 18:56:59 tethys kernel: XFS (dm-28): Corruption detected.
Unmount and run xfs_repair
May 01 18:56:59 tethys kernel: XFS (dm-28): Corruption of in-memory
data (0x8) detected at xfs_defer_finish_noroll+0x130/0x280 [xfs]
(fs/xfs/libxfs/xfs_defer.c:573).  Shutting down filesystem.
May 01 18:56:59 tethys kernel: XFS (dm-28): Please unmount the
filesystem and rectify the problem(s)

And here's what I see in dmesg after rebooting and attempting to mount
the filesystem to replay the log:

May 01 21:34:15 tethys kernel: XFS (dm-35): Metadata corruption
detected at xfs_inode_buf_verify+0x168/0x190 [xfs], xfs_inode block
0x1405a0 xfs_inode_buf_verify
May 01 21:34:15 tethys kernel: XFS (dm-35): Unmount and run xfs_repair
May 01 21:34:15 tethys kernel: XFS (dm-35): First 128 bytes of
corrupted metadata buffer:
May 01 21:34:15 tethys kernel: 00000000: 5b 40 e2 3a ae 52 a0 7a 17 1d
5a f6 f0 de 4c 62  [@.:.R.z..Z...Lb
May 01 21:34:15 tethys kernel: 00000010: d6 31 8b 51 ca 6e ad a2 7e f5
18 65 6e 8a 41 3f  .1.Q.n..~..en.A?
May 01 21:34:15 tethys kernel: 00000020: 68 b5 02 16 2c 84 5d 33 ac 46
fc c9 da 93 af 3f  h...,.]3.F.....?
May 01 21:34:15 tethys kernel: 00000030: a0 3e b7 9c b4 99 5a 45 8c 2f
13 ed bb 07 57 e1  .>....ZE./....W.
May 01 21:34:15 tethys kernel: 00000040: bc 96 aa d7 00 2a 81 65 e6 3b
86 9d b5 0a 63 bd  .....*.e.;....c.
May 01 21:34:15 tethys kernel: 00000050: 38 e5 63 1a 09 42 36 4c b8 e8
7c 92 73 01 04 da  8.c..B6L..|.s...
May 01 21:34:15 tethys kernel: 00000060: 27 df 43 92 b1 ad ba ec 7a 02
3f 8e 84 3a bb cc  '.C.....z.?..:..
May 01 21:34:15 tethys kernel: 00000070: 39 06 74 d1 8b 04 b7 f2 62 c1
c4 f0 3c 5c 54 4f  9.t.....b...<\TO
May 01 21:34:15 tethys kernel: XFS (dm-35): metadata I/O error in
"xlog_recover_items_pass2+0x56/0xf0 [xfs]" at daddr 0x1405a0 len 32
error 117
May 01 21:34:15 tethys kernel: XFS (dm-35): log mount/recovery failed:
error -117
May 01 21:34:15 tethys kernel: XFS (dm-35): log mount failed

Blockchain projects tend to generate pathological filesystem loads;
the sustained random write activity and constant (re)allocations must
be pushing on some soft spot here. Reverting to kernel 6.2.14 and
recreating the filesystems seems to have resolved the issue—so far, at
least—but obviously this is less than ideal. If someone would be
willing to provide a targeted listed of desired artifacts I'd be happy
to boot back into kernel 6.3.1 to reproduce the issue and collect
them. Alternatively I can try to eliminate some variables (like LVM2,
potential hardware instabilities, etc.) and provide step-by-step
directions for reproducing the issue on another machine.

Thank you,

Mike