Re: xfs clones crash issue - illegal state 13 in block map

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 10 Oct 2017 12:18:25 -0700

On Mon, Oct 09, 2017 at 08:48:34PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2017/9/19 13:38, Darrick J. Wong wrote:
> > On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote:
> >> On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> >>> On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote:
> >>>> Hi guys,
> >>>>
> >>>> I am getting these errors often when running the crash tests
> >>>> with cloned files (generic/502 in my xfstests patches).
> >>>>
> >>>> Hitting these errors requires first fixing 2 other issues
> >>>> that shadow over this issue:
> >>>> "xfs: fix incorrect log_flushed on fsync" (in master)
> >>>> "xfs: fix leftover CoW extent after truncate"
> >>>> available on my tree based on Darrick's simple fix:
> >>>> https://github.com/amir73il/linux/commits/xfs-fsync
> >>>>
> >>>> I get the errors more often (1 out of 5) on a 100G fs on spinning disk.
> >>>> On a 10G fs on SSD they are less frequent.
> >>>> The log in this email was captured on patched stable 4.9.47 kernel,
> >>>> but I am getting the same errors on patched upstream kernel.
> >>>>
> >>>> I wasn't able to create a deterministic reproducer, so attaching
> >>>> the full log from a failed test along with an IO log that can be
> >>>> replayed on your disk to examine the outcome.
> >>>>
> >>>> Following is the output of fsx process #5, which is the process
> >>>> that wrote the problematic testfile5.mark0 to the log.
> >>>> This process performs only read,zero,fsync before creating
> >>>> the log mark.
> >>>> The file testfile5 was cloned from an origin 256K file before
> >>>> running fsx.
> >>>> Later, I used the random seed 35484 in this log for all
> >>>> processes and it seemed to increase the probability for failure.
> >>>>
> >>>> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P
> >>>> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5
> >>>> /mnt/scratch/testfile5
> >>>> Seed set to 35484
> >>>> file_size=262144
> >>>> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes)
> >>>> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes)
> >>>> 5: 3 fsync
> >>>> 5: Dumped fsync buffer to testfile5.mark0
> >>>>
> >>>> In order to get to the crash state you need to get my
> >>>> xfstests replay-log patches and replay the attached log
> >>>> on a >= 100G scratch device:
> >>>>
> >>>> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay
> >>>> $SCRATCH_DEV --end-mark testfile5.mark0
> >>>> # mount $SCRATCH_DEV $SCRATCH_MNT
> >>>> # umount $SCRATCH_MNT
> >>>> # xfs_repair -n $SCRATCH_DEV
> >>>> Phase 1 - find and verify superblock...
> >>>> Phase 2 - using internal log
> >>>>         - zero log...
> >>>>         - scan filesystem freespace and inode maps...
> >>>>         - found root inode chunk
> >>>> Phase 3 - for each AG...
> >>>>         - scan (but don't clear) agi unlinked lists...
> >>>>         - process known inodes and perform inode discovery...
> >>>>         - agno = 0
> >>>>
> >>>> fatal error -- illegal state 13 in block map 376
> >>>>
> >>>> Can anyone provide some insight?
> >>>
> >>> Looks like I missed a couple of extent states in process_bmbt_reclist_int.
> >>>
> >>> What happens if you add the following (only compile tested) patch to
> >>> xfsprogs?
> >>
> >> This is what happens:
> >>
> >> Phase 1 - find and verify superblock...
> >> Phase 2 - using internal log
> >>         - zero log...
> >>         - scan filesystem freespace and inode maps...
> >>         - found root inode chunk
> >> Phase 3 - for each AG...
> >>         - scan (but don't clear) agi unlinked lists...
> >>         - process known inodes and perform inode discovery...
> >>         - agno = 0
> >> data fork in regular inode 134 claims CoW block 376
> >> correcting nextents for inode 134
> >> bad data fork in inode 134
> >> would have cleared inode 134
> >>         - agno = 1
> >>         - agno = 2
> >>         - agno = 3
> >>         - process newly discovered inodes...
> >> Phase 4 - check for duplicate blocks...
> >>         - setting up duplicate extent list...
> >> unknown block state, ag 0, block 376
> >> unknown block state, ag 1, block 16
> >>         - check for inodes claiming duplicate blocks...
> >>         - agno = 0
> >>         - agno = 1
> >>         - agno = 2
> >> entry "testfile2" in shortform directory 128 references free inode 134
> >>         - agno = 3
> >> would have junked entry "testfile2" in directory inode 128
> >> imap claims in-use inode 134 is free, would correct imap
> >> Missing reverse-mapping record for (0/376) len 1 owner 134 off 19
> >> No modify flag set, skipping phase 5
> >> Phase 6 - check inode connectivity...
> >>         - traversing filesystem ...
> >>         - traversal finished ...
> >>         - moving disconnected inodes to lost+found ...
> >> Phase 7 - verify link counts...
> >> No modify flag set, skipping filesystem flush and exiting.
> > 
> > Hmm.
> > 
> > # xfs_db bugs.img
> > xfs_db> inode 134
> > xfs_db> bmap
> > data offset 0 startblock 24 (0/24) count 19 flag 0
> > data offset 19 startblock 376 (0/376) count 1 flag 0
> > data offset 25 startblock 8388624 (1/16) count 4 flag 1
> > data offset 29 startblock 53 (0/53) count 35 flag 0
> > 
> > Ok, so inode 134 thinks it owns block 376.
> > 
> > xfs_db> agf 0
> > xfs_db> addr refcntroot
> > xfs_db> p
> > ...
> > recs[1-22] = [startblock,blockcount,refcount,cowflag] 
> > 1:[24,1,11,0] 
> > ...
> > 22:[87,1,6,0]
> > xfs_db>
> > 
> > Ok, so the refcountbt doesn't know about block 376, so it's neither
> > shared nor a cow staging extent.
> > 
> > xfs_db> agf 0 
> > xfs_db> addr rmaproot
> > xfs_db> p
> > ...
> > recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] 
> > 1:[0,1,-3,0,0,0,0] 
> > ...
> > 38:[376,1,-9,0,0,0,0]
> > xfs_db> 
> > 
> > So it seems that the rmapbt is behind the times, since record 38 should
> > be [376,1,134,19,0,0,0].  Either that's because there's something wrong
> > with the regular deferred ops mechanism, or log recovery of deferred ops.
> I think the log recovery of deferred ops is incorrect, and we also need to
> add more checks when manipulating the b-tree of rmap.
> 
> I had reproduced the problem on my KVM VM, and i will explain the incorrectness of
> log recovery of deferred ops by using my own example. If addition information is
> needed, please just ask.
> 
> The phenomena is the same: a block extent had been assigned to a file, but in the
> rmap b-tree the block extent belongs to COW procedure.
> 
> We can get the following output from xfs_db after mounting and umounting the device:
> 
> ./src/log-writes/replay-log --replay /dev/vdd --end-mark testfile7.mark0 --log /dev/vde
> 
> xfs_logprint -o /dev/vdd > pre_mount.log
> mount /dev/vdd /tmp/vdd
> umount /tmp/vdd
> xfs_logprint -o /dev/vdd > post_mount.log
> 
> xfs_db /dev/vdd
> xfs_db> inode 137
> xfs_db> p u3.bmx[3]
> u3.bmx[3] = [startoff,startblock,blockcount,extentflag]
> 3:[58,341,2,0]
> 
> xfs_db> agf 0
> xfs_db> addr rmaproot
> xfs_db> p recs[44]
> recs[44] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
> 44:[341,2,-9,0,0,0,0]
> 
> The last log record in pre_mount.log (generated before mount) has the following ops
> which are related with inode 137:
> 
> # from pre_mount.log
> # initiate by xfs_reflink_end_cow()
> # decrease ref: [0x52, +2]
> # free cow: [0x155, +2]
> ----------------------------------------------------------------------------
> Oper (160): tid: c2350d60  len: 48  clientid: TRANS  flags: none
> CUI:  #regs: 1  num_extents: 2  id: 0xffff88011b37c680
> (s: 0x52, l: 2, f: 0x2)
> (s: 0x155, l: 2, f: 0x4)
> ----------------------------------------------------------------------------
> 
> # map shared: [0x155, +2] to inode 137
> ----------------------------------------------------------------------------
> Oper (161): tid: c2350d60  len: 48  clientid: TRANS  flags: none
> BUI:  #regs: 1  num_extents: 1  id: 0xffff880117f0c300
> (s: 0x155, l: 2, own: 137, off: 58, f: 0x1)
> ----------------------------------------------------------------------------
> 
> # for CUI Oper (160)
> ----------------------------------------------------------------------------
> Oper (194): tid: c2350d60  len: 16  clientid: TRANS  flags: none
> CUD:  #regs: 1                   id: 0xffff88011b37c680
> ----------------------------------------------------------------------------
> 
> # free rmap: [0x155, +2]
> ----------------------------------------------------------------------------
> Oper (197): tid: c2350d60  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff880124a3ea00
> (s: 0x155, l: 2, own: -9, off: 0, f: 0x8)
> ----------------------------------------------------------------------------
> Oper (198): tid: c2350d60  len: 0  clientid: TRANS  flags: COMMIT
> ============================================================================
> 
> According to the implementation of xfs_defer_finish(), these ops should be completed
> in the following sequence:
> 
> Have been done:
> (1) CUI: Oper (160)
> (2) BUI: Oper (161)
> (3) CUD: Oper (194), for CUI Oper (160)
> (4) RUI A: Oper (197), free rmap [0x155, 2, -9]
> 
> Should be done:
> (5) BUD: for BUI Oper (161)
> (6) RUI B: add rmap [0x155, 2, 137]
> (7) RUD: for RUI A
> (8) RUD: for RUI B
> 
> Actually be done by xlog_recover_process_intents()
> (5) BUD: for BUI Oper (161)
> (6) RUI B: add rmap [0x155, 2, 137]
> (7) RUD: for RUI B
> (8) RUD: for RUI A
> 
> So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
> then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
> from the log record in post_mount.log (generated after umount) and the trace
> print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
> entry [0x155, 2, -9] are freed.

Hmm, yes, the BUI replay causes RUI B to be created and attached to the
transaction it created to replay the BUI... and then we make the fatal
mistake to finish the BUI's accumulated defer_ops before replaying RUI
A.  If we were /not/ in log recovery we'd append the new intent items to
the end of the defer_ops that's coordinating all the intent items, but
in log recovery each intent starts its own transaction and manages its
own defer_ops, so everything runs out of order.

A way to solve this is if xlog_recover_process_intents instead creates
its own defer_ops and passes that into xlog_recover_process_*ui to
collect all the /new/ intent items.  Once we're done dealing with the
intent items that were already in the log, we can create another
transaction and xfs_defer_finish all the new defer_ops.

Hard questions: We'd also have to collect block usage requirements for
the transaction and it's not clear if it's problematic to be collecting
all these follow-on intents (which may have come from totally different
defer_ops pre-crash) in a single defer_ops.  I'll have to go have a
think about this one.

> # from post_mount.log
> ----------------------------------------------------------------------------
> Oper (2): tid: 3c5adcc1  len: 16  clientid: TRANS  flags: none
> BUD:  #regs: 1                   id: 0xffff880117f0c300
> ----------------------------------------------------------------------------
> 
> # add rmap [0x155, 2, 137]
> ----------------------------------------------------------------------------
> Oper (6): tid: 3c5adcc1  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff880125bdc2a0
> (s: 0x155, l: 2, own: 137, off: 58, f: 0x2)
> ----------------------------------------------------------------------------
> Oper (7): tid: 3c5adcc1  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff880125bdc2a0
> ----------------------------------------------------------------------------
> 
> # remove rmap [0x155, 2, -9]
> ----------------------------------------------------------------------------
> Oper (20): tid: 3c5adcc1  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff880124a3ea00
> ----------------------------------------------------------------------------
> 
> # from trace print
> 
> # add rmap [0x155, 2, 137], find [0x12d, 46, -9], and insert [0x155, 2, 137]
> xfs_rmap_defer: dev 253:48 op 1 agno 0 agbno 341 owner 137 data offset 58 len 2 state 0
> xfs_rmap_deferred: dev 253:48 op 1 agno 0 agbno 341 owner 137 data offset 58 len 2 state 0
> xfs_rmap_map: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> xfs_rmap_find_left_neighbor_query: dev 253:48 agno 0 agbno 341 len 0 owner 137 offset 58 flags 0x0
> xfs_rmap_find_left_neighbor_candidate: dev 253:48 agno 0 agbno 301 len 46 owner -9 offset 0 flags 0x0
> xfs_rmap_insert: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> xfs_rmap_map_done: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> 
> # del rmap [0x155, 2, -9], find [0x155, 2, 137], and remove [0x155, 2, 137]
> xfs_rmap_deferred: dev 253:48 op 7 agno 0 agbno 341 owner -9 data offset 0 len 2 state 0
> xfs_rmap_unmap: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0
> xfs_rmap_lookup_le_range_result: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0

So we do a non-overlapped lookup for [0x155, 2, -9] and we get [0x155,
2, 137] because the non-overlapped lookups stop when they get to the
first record for the block number, and -9 comes after 137.  We /ought/
at least to bail out of _rmap_unmap except for this garbled check:

/* Make sure the owner matches what we expect to find in the tree. */
XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner ||
			    XFS_RMAP_NON_INODE_OWNER(owner), out_error);

I think that really ought to be "owner == ltrec.rm_owner || owner ==
XFS_RMAP_OWN_UNKNOWN" instead of what's there.

Ofc that will just make it so that recovery blows up and the fs is
unmountable, which doesn't get us very far if we don't fix the problem
of the misordered intent item replays.

--D

> xfs_rmap_delete: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> xfs_rmap_unmap_done: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0
> 
> # del rmap [0x12d, 40, -9]
> xfs_rmap_unmap: dev 253:48 agno 0 agbno 301 len 40 owner -9 offset 0 flags 0x0
> # del rmap [0x157, 4, -9]
> xfs_rmap_unmap: dev 253:48 agno 0 agbno 343 len 4 owner -9 offset 0 flags 0x0
> xfs_rmap_update: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0
> 
> Regards,
> 
> Tao
> 
> > Let's have a look at the log...
> > 
> > # xfs_logprint bugs.img | grep x178 -B3
> > Oper (426): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> > CUI:  #regs: 1  num_extents: 2  id: 0xffff8f0e842f74b8
> > (s: 0x2b, l: 1, f: 0x2) 
> > (s: 0x178, l: 1, f: 0x4) 
> > 
> > ...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from
> > the refcountbt...
> > 
> > ----------------------------------------------------------------------------
> > Oper (427): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> > BUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e81f39000
> > (s: 0x178, l: 1, own: 134, off: 19, f: 0x1) 
> > 
> > ...now we map block 376 into the data fork of inode 134 at offset 19...
> > 
> > --
> > ----------------------------------------------------------------------------
> > Oper (447): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254000
> > (s: 0x178, l: 1, own: -9, off: 0, f: 0x8) 
> > 
> > ...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap,
> > which ought to remove the record...
> > 
> > --
> > ----------------------------------------------------------------------------
> > Oper (453): tid: e0ecea6a  len: 48  clientid: TRANS  flags: none
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
> > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
> > 
> > ...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which
> > should add the [376,1,134,19,0,0,0] that we'd expect to be there...
> > 
> > --
> > ----------------------------------------------------------------------------
> > Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
> > 
> > ...strange, another rmap add?  Did the RUIs fail to complete?
> > 
> > # xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)'
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
> > RUD:  #regs: 1                   id: 0xffff8f0e48254530
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> > RUD:  #regs: 1                   id: 0xffff925955341f20
> > 
> > ...the first RUI is logged as operation 453 in tid e0ecea6a, and it has
> > an RUD at 460.  However, I don't see a COMMIT, so I gather that log
> > recovery tried to replay the RUI as operation 13 in tid 17c5790c.
> > 
> > ----------------------------------------------------------------------------
> > Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
> > ----------------------------------------------------------------------------
> > Oper (14): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
> > RUD:  #regs: 1                   id: 0xffff925955341f20
> > ----------------------------------------------------------------------------
> > Oper (15): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
> > RUD:  #regs: 1                   id: 0xffff8f0e48254000
> > ----------------------------------------------------------------------------
> > Oper (16): tid: 17c5790c  len: 24  clientid: TRANS  flags: none
> > BUF:  #regs: 3   start blkno: 40 (0x28)  len: 8  bmap size: 1  flags: 0x2000
> > Oper (17): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
> > BUF DATA
> >  0 33424d52 31000000 ffffffff ffffffff        0 28000000  1000000 38000000 
> >  8 3674196c 564cfddf 7798e496 591b6410        0 42249385        0  1000000 
> > 10 ffffffff fdffffff        0        0  1000000  2000000 ffffffff fbffffff 
> > 18        0        0  3000000  2000000 ffffffff faffffff        0        0 
> > 
> > Oper (18): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
> > BUF DATA
> >  0        0        0 4d010000  9000000        0 8a000000        0 2a000000 
> >  8 56010000  d000000 ffffffff f7ffffff        0        0 65010000 6b000000 
> > 10 ffffffff f7ffffff        0        0 78010000  1000000        0 86000000 
> > 18        0 13000000        0        0        0        0        0        0 
> > 
> > ----------------------------------------------------------------------------
> > Oper (19): tid: 17c5790c  len: 0  clientid: TRANS  flags: COMMIT 
> > 
> > Ok, so those last two lines in operation 16 show that we've indeed added the
> > new record to the rmapbt... <shrug> <confused> <tired>??
> > 
> > Are we sure that the log recovery actions actually get checkpointed
> > correctly?  And just to check my own sanity, we generate a snapshot and
> > then recover it and run xfs_repair, right?
> > 
> > --D
> > 
> >>
> >>>
> >>> (Normally I'd say send a metadump too for us mere mortals to work with,
> >>> though I'm about to plunge into weddingland so I likely won't be able to
> >>> do much until the 18th.)
> >>>
> >>
> >> Attached (used xfs_metadump -ao)
> >> Soon we will all be gods with powers to replay history ;)
> >>
> >>> ((Eric: If this doesn't turn out to be a totally garbage patch, feel
> >>> free to add it to xfsprogs.))
> >>>
> >>> --D
> >>>
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > .
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html