Re: xfs clones crash issue - illegal state 13 in block map

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 19 Sep 2017 09:16:54 +0300

On Tue, Sep 19, 2017 at 8:38 AM, Darrick J. Wong
<darrick.wong@xxxxxxxxxx> wrote:
> On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote:
>> On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
>> > On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote:
>> >> Hi guys,
>> >>
>> >> I am getting these errors often when running the crash tests
>> >> with cloned files (generic/502 in my xfstests patches).
>> >>
>> >> Hitting these errors requires first fixing 2 other issues
>> >> that shadow over this issue:
>> >> "xfs: fix incorrect log_flushed on fsync" (in master)
>> >> "xfs: fix leftover CoW extent after truncate"
>> >> available on my tree based on Darrick's simple fix:
>> >> https://github.com/amir73il/linux/commits/xfs-fsync
>> >>
>> >> I get the errors more often (1 out of 5) on a 100G fs on spinning disk.
>> >> On a 10G fs on SSD they are less frequent.
>> >> The log in this email was captured on patched stable 4.9.47 kernel,
>> >> but I am getting the same errors on patched upstream kernel.
>> >>
>> >> I wasn't able to create a deterministic reproducer, so attaching
>> >> the full log from a failed test along with an IO log that can be
>> >> replayed on your disk to examine the outcome.
>> >>
>> >> Following is the output of fsx process #5, which is the process
>> >> that wrote the problematic testfile5.mark0 to the log.
>> >> This process performs only read,zero,fsync before creating
>> >> the log mark.
>> >> The file testfile5 was cloned from an origin 256K file before
>> >> running fsx.
>> >> Later, I used the random seed 35484 in this log for all
>> >> processes and it seemed to increase the probability for failure.
>> >>
>> >> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P
>> >> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5
>> >> /mnt/scratch/testfile5
>> >> Seed set to 35484
>> >> file_size=262144
>> >> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes)
>> >> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes)
>> >> 5: 3 fsync
>> >> 5: Dumped fsync buffer to testfile5.mark0
>> >>
>> >> In order to get to the crash state you need to get my
>> >> xfstests replay-log patches and replay the attached log
>> >> on a >= 100G scratch device:
>> >>
>> >> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay
>> >> $SCRATCH_DEV --end-mark testfile5.mark0
>> >> # mount $SCRATCH_DEV $SCRATCH_MNT
>> >> # umount $SCRATCH_MNT
>> >> # xfs_repair -n $SCRATCH_DEV
>> >> Phase 1 - find and verify superblock...
>> >> Phase 2 - using internal log
>> >>         - zero log...
>> >>         - scan filesystem freespace and inode maps...
>> >>         - found root inode chunk
>> >> Phase 3 - for each AG...
>> >>         - scan (but don't clear) agi unlinked lists...
>> >>         - process known inodes and perform inode discovery...
>> >>         - agno = 0
>> >>
>> >> fatal error -- illegal state 13 in block map 376
>> >>
>> >> Can anyone provide some insight?
>> >
>> > Looks like I missed a couple of extent states in process_bmbt_reclist_int.
>> >
>> > What happens if you add the following (only compile tested) patch to
>> > xfsprogs?
>>
>> This is what happens:
>>
>> Phase 1 - find and verify superblock...
>> Phase 2 - using internal log
>>         - zero log...
>>         - scan filesystem freespace and inode maps...
>>         - found root inode chunk
>> Phase 3 - for each AG...
>>         - scan (but don't clear) agi unlinked lists...
>>         - process known inodes and perform inode discovery...
>>         - agno = 0
>> data fork in regular inode 134 claims CoW block 376
>> correcting nextents for inode 134
>> bad data fork in inode 134
>> would have cleared inode 134
>>         - agno = 1
>>         - agno = 2
>>         - agno = 3
>>         - process newly discovered inodes...
>> Phase 4 - check for duplicate blocks...
>>         - setting up duplicate extent list...
>> unknown block state, ag 0, block 376
>> unknown block state, ag 1, block 16
>>         - check for inodes claiming duplicate blocks...
>>         - agno = 0
>>         - agno = 1
>>         - agno = 2
>> entry "testfile2" in shortform directory 128 references free inode 134
>>         - agno = 3
>> would have junked entry "testfile2" in directory inode 128
>> imap claims in-use inode 134 is free, would correct imap
>> Missing reverse-mapping record for (0/376) len 1 owner 134 off 19
>> No modify flag set, skipping phase 5
>> Phase 6 - check inode connectivity...
>>         - traversing filesystem ...
>>         - traversal finished ...
>>         - moving disconnected inodes to lost+found ...
>> Phase 7 - verify link counts...
>> No modify flag set, skipping filesystem flush and exiting.
>
> Hmm.
>
> # xfs_db bugs.img
> xfs_db> inode 134
> xfs_db> bmap
> data offset 0 startblock 24 (0/24) count 19 flag 0
> data offset 19 startblock 376 (0/376) count 1 flag 0
> data offset 25 startblock 8388624 (1/16) count 4 flag 1
> data offset 29 startblock 53 (0/53) count 35 flag 0
>
> Ok, so inode 134 thinks it owns block 376.
>
> xfs_db> agf 0
> xfs_db> addr refcntroot
> xfs_db> p
> ...
> recs[1-22] = [startblock,blockcount,refcount,cowflag]
> 1:[24,1,11,0]
> ...
> 22:[87,1,6,0]
> xfs_db>
>
> Ok, so the refcountbt doesn't know about block 376, so it's neither
> shared nor a cow staging extent.
>
> xfs_db> agf 0
> xfs_db> addr rmaproot
> xfs_db> p
> ...
> recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
> 1:[0,1,-3,0,0,0,0]
> ...
> 38:[376,1,-9,0,0,0,0]
> xfs_db>
>
> So it seems that the rmapbt is behind the times, since record 38 should
> be [376,1,134,19,0,0,0].  Either that's because there's something wrong
> with the regular deferred ops mechanism, or log recovery of deferred ops.
> Let's have a look at the log...
>
> # xfs_logprint bugs.img | grep x178 -B3
> Oper (426): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> CUI:  #regs: 1  num_extents: 2  id: 0xffff8f0e842f74b8
> (s: 0x2b, l: 1, f: 0x2)
> (s: 0x178, l: 1, f: 0x4)
>
> ...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from
> the refcountbt...
>
> ----------------------------------------------------------------------------
> Oper (427): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> BUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e81f39000
> (s: 0x178, l: 1, own: 134, off: 19, f: 0x1)
>
> ...now we map block 376 into the data fork of inode 134 at offset 19...
>
> --
> ----------------------------------------------------------------------------
> Oper (447): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254000
> (s: 0x178, l: 1, own: -9, off: 0, f: 0x8)
>
> ...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap,
> which ought to remove the record...
>
> --
> ----------------------------------------------------------------------------
> Oper (453): tid: e0ecea6a  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
> (s: 0x178, l: 1, own: 134, off: 19, f: 0x2)
>
> ...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which
> should add the [376,1,134,19,0,0,0] that we'd expect to be there...
>
> --
> ----------------------------------------------------------------------------
> Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> (s: 0x178, l: 1, own: 134, off: 19, f: 0x2)
>
> ...strange, another rmap add?  Did the RUIs fail to complete?
>
> # xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)'
> RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
> RUD:  #regs: 1                   id: 0xffff8f0e48254530
> RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> RUD:  #regs: 1                   id: 0xffff925955341f20
>
> ...the first RUI is logged as operation 453 in tid e0ecea6a, and it has
> an RUD at 460.  However, I don't see a COMMIT, so I gather that log
> recovery tried to replay the RUI as operation 13 in tid 17c5790c.
>
> ----------------------------------------------------------------------------
> Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> (s: 0x178, l: 1, own: 134, off: 19, f: 0x2)
> ----------------------------------------------------------------------------
> Oper (14): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff925955341f20
> ----------------------------------------------------------------------------
> Oper (15): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff8f0e48254000
> ----------------------------------------------------------------------------
> Oper (16): tid: 17c5790c  len: 24  clientid: TRANS  flags: none
> BUF:  #regs: 3   start blkno: 40 (0x28)  len: 8  bmap size: 1  flags: 0x2000
> Oper (17): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
> BUF DATA
>  0 33424d52 31000000 ffffffff ffffffff        0 28000000  1000000 38000000
>  8 3674196c 564cfddf 7798e496 591b6410        0 42249385        0  1000000
> 10 ffffffff fdffffff        0        0  1000000  2000000 ffffffff fbffffff
> 18        0        0  3000000  2000000 ffffffff faffffff        0        0
>
> Oper (18): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
> BUF DATA
>  0        0        0 4d010000  9000000        0 8a000000        0 2a000000
>  8 56010000  d000000 ffffffff f7ffffff        0        0 65010000 6b000000
> 10 ffffffff f7ffffff        0        0 78010000  1000000        0 86000000
> 18        0 13000000        0        0        0        0        0        0
>
> ----------------------------------------------------------------------------
> Oper (19): tid: 17c5790c  len: 0  clientid: TRANS  flags: COMMIT
>
> Ok, so those last two lines in operation 16 show that we've indeed added the
> new record to the rmapbt... <shrug> <confused> <tired>??
>
> Are we sure that the log recovery actions actually get checkpointed
> correctly?

I'm not sure of anything. dm-log-writes may have bugs, but so far all
suspicions rebounded back to file system bugs.
I'm pretty sure, though, that dm-log-writes does not discriminate
log recovery actions, it is simply not smart enough for that.

To better understand what happenened on block level, I added the
range filter options --start-sector --end-sector to replay-log tool.
If you narrowed down the problem to a physical block whose content
is in dispute (which was the case with data corruption bug), you can
use replay-log --vv on the said physical range to study the case.

replay-log is now in upstream xfstests and so is test  generic/457
which reproduces this error.

> And just to check my own sanity, we generate a snapshot and
> then recover it and run xfs_repair, right?
>

Let me rephrase that for clarity of terms:
- We record all IO since before mkfs
- We write a snapshot mark in the recording at times (post fsync in this test)
Then
- We replay IO on device until a mark (i.e. restore from snapshot)
- We mount to recover xfs log
- We verify content of fsynced file
- We umount and check fs

I sent you the metadump from this point in time. I can send you
metadump from before mount, so you can examine that as well.
Or you can do on a >= 100G scratch device:

# ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay
$SCRATCH_DEV --end-mark testfile5.mark0

Let me know if you need anything else.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html