Re: xfs clones crash issue - illegal state 13 in block map

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 22 Nov 2017 10:25:48 -0800

On Tue, Oct 10, 2017 at 12:18:25PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 09, 2017 at 08:48:34PM +0800, Hou Tao wrote:
> > Hi,
> > 
> > On 2017/9/19 13:38, Darrick J. Wong wrote:
> > > On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote:
> > >> On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> > >>> On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote:
> > >>>> Hi guys,
> > >>>>
> > >>>> I am getting these errors often when running the crash tests
> > >>>> with cloned files (generic/502 in my xfstests patches).
> > >>>>
> > >>>> Hitting these errors requires first fixing 2 other issues
> > >>>> that shadow over this issue:
> > >>>> "xfs: fix incorrect log_flushed on fsync" (in master)
> > >>>> "xfs: fix leftover CoW extent after truncate"
> > >>>> available on my tree based on Darrick's simple fix:
> > >>>> https://github.com/amir73il/linux/commits/xfs-fsync
> > >>>>
> > >>>> I get the errors more often (1 out of 5) on a 100G fs on spinning disk.
> > >>>> On a 10G fs on SSD they are less frequent.
> > >>>> The log in this email was captured on patched stable 4.9.47 kernel,
> > >>>> but I am getting the same errors on patched upstream kernel.
> > >>>>
> > >>>> I wasn't able to create a deterministic reproducer, so attaching
> > >>>> the full log from a failed test along with an IO log that can be
> > >>>> replayed on your disk to examine the outcome.
> > >>>>
> > >>>> Following is the output of fsx process #5, which is the process
> > >>>> that wrote the problematic testfile5.mark0 to the log.
> > >>>> This process performs only read,zero,fsync before creating
> > >>>> the log mark.
> > >>>> The file testfile5 was cloned from an origin 256K file before
> > >>>> running fsx.
> > >>>> Later, I used the random seed 35484 in this log for all
> > >>>> processes and it seemed to increase the probability for failure.
> > >>>>
> > >>>> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P
> > >>>> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5
> > >>>> /mnt/scratch/testfile5
> > >>>> Seed set to 35484
> > >>>> file_size=262144
> > >>>> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes)
> > >>>> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes)
> > >>>> 5: 3 fsync
> > >>>> 5: Dumped fsync buffer to testfile5.mark0
> > >>>>
> > >>>> In order to get to the crash state you need to get my
> > >>>> xfstests replay-log patches and replay the attached log
> > >>>> on a >= 100G scratch device:
> > >>>>
> > >>>> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay
> > >>>> $SCRATCH_DEV --end-mark testfile5.mark0
> > >>>> # mount $SCRATCH_DEV $SCRATCH_MNT
> > >>>> # umount $SCRATCH_MNT
> > >>>> # xfs_repair -n $SCRATCH_DEV
> > >>>> Phase 1 - find and verify superblock...
> > >>>> Phase 2 - using internal log
> > >>>>         - zero log...
> > >>>>         - scan filesystem freespace and inode maps...
> > >>>>         - found root inode chunk
> > >>>> Phase 3 - for each AG...
> > >>>>         - scan (but don't clear) agi unlinked lists...
> > >>>>         - process known inodes and perform inode discovery...
> > >>>>         - agno = 0
> > >>>>
> > >>>> fatal error -- illegal state 13 in block map 376
> > >>>>
> > >>>> Can anyone provide some insight?
> > >>>
> > >>> Looks like I missed a couple of extent states in process_bmbt_reclist_int.
> > >>>
> > >>> What happens if you add the following (only compile tested) patch to
> > >>> xfsprogs?
> > >>
> > >> This is what happens:
> > >>
> > >> Phase 1 - find and verify superblock...
> > >> Phase 2 - using internal log
> > >>         - zero log...
> > >>         - scan filesystem freespace and inode maps...
> > >>         - found root inode chunk
> > >> Phase 3 - for each AG...
> > >>         - scan (but don't clear) agi unlinked lists...
> > >>         - process known inodes and perform inode discovery...
> > >>         - agno = 0
> > >> data fork in regular inode 134 claims CoW block 376
> > >> correcting nextents for inode 134
> > >> bad data fork in inode 134
> > >> would have cleared inode 134
> > >>         - agno = 1
> > >>         - agno = 2
> > >>         - agno = 3
> > >>         - process newly discovered inodes...
> > >> Phase 4 - check for duplicate blocks...
> > >>         - setting up duplicate extent list...
> > >> unknown block state, ag 0, block 376
> > >> unknown block state, ag 1, block 16
> > >>         - check for inodes claiming duplicate blocks...
> > >>         - agno = 0
> > >>         - agno = 1
> > >>         - agno = 2
> > >> entry "testfile2" in shortform directory 128 references free inode 134
> > >>         - agno = 3
> > >> would have junked entry "testfile2" in directory inode 128
> > >> imap claims in-use inode 134 is free, would correct imap
> > >> Missing reverse-mapping record for (0/376) len 1 owner 134 off 19
> > >> No modify flag set, skipping phase 5
> > >> Phase 6 - check inode connectivity...
> > >>         - traversing filesystem ...
> > >>         - traversal finished ...
> > >>         - moving disconnected inodes to lost+found ...
> > >> Phase 7 - verify link counts...
> > >> No modify flag set, skipping filesystem flush and exiting.
> > > 
> > > Hmm.
> > > 
> > > # xfs_db bugs.img
> > > xfs_db> inode 134
> > > xfs_db> bmap
> > > data offset 0 startblock 24 (0/24) count 19 flag 0
> > > data offset 19 startblock 376 (0/376) count 1 flag 0
> > > data offset 25 startblock 8388624 (1/16) count 4 flag 1
> > > data offset 29 startblock 53 (0/53) count 35 flag 0
> > > 
> > > Ok, so inode 134 thinks it owns block 376.
> > > 
> > > xfs_db> agf 0
> > > xfs_db> addr refcntroot
> > > xfs_db> p
> > > ...
> > > recs[1-22] = [startblock,blockcount,refcount,cowflag] 
> > > 1:[24,1,11,0] 
> > > ...
> > > 22:[87,1,6,0]
> > > xfs_db>
> > > 
> > > Ok, so the refcountbt doesn't know about block 376, so it's neither
> > > shared nor a cow staging extent.
> > > 
> > > xfs_db> agf 0 
> > > xfs_db> addr rmaproot
> > > xfs_db> p
> > > ...
> > > recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] 
> > > 1:[0,1,-3,0,0,0,0] 
> > > ...
> > > 38:[376,1,-9,0,0,0,0]
> > > xfs_db> 
> > > 
> > > So it seems that the rmapbt is behind the times, since record 38 should
> > > be [376,1,134,19,0,0,0].  Either that's because there's something wrong
> > > with the regular deferred ops mechanism, or log recovery of deferred ops.
> > I think the log recovery of deferred ops is incorrect, and we also need to
> > add more checks when manipulating the b-tree of rmap.
> > 
> > I had reproduced the problem on my KVM VM, and i will explain the incorrectness of
> > log recovery of deferred ops by using my own example. If addition information is
> > needed, please just ask.
> > 
> > The phenomena is the same: a block extent had been assigned to a file, but in the
> > rmap b-tree the block extent belongs to COW procedure.
> > 
> > We can get the following output from xfs_db after mounting and umounting the device:
> > 
> > ./src/log-writes/replay-log --replay /dev/vdd --end-mark testfile7.mark0 --log /dev/vde
> > 
> > xfs_logprint -o /dev/vdd > pre_mount.log
> > mount /dev/vdd /tmp/vdd
> > umount /tmp/vdd
> > xfs_logprint -o /dev/vdd > post_mount.log
> > 
> > xfs_db /dev/vdd
> > xfs_db> inode 137
> > xfs_db> p u3.bmx[3]
> > u3.bmx[3] = [startoff,startblock,blockcount,extentflag]
> > 3:[58,341,2,0]
> > 
> > xfs_db> agf 0
> > xfs_db> addr rmaproot
> > xfs_db> p recs[44]
> > recs[44] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
> > 44:[341,2,-9,0,0,0,0]
> > 
> > The last log record in pre_mount.log (generated before mount) has the following ops
> > which are related with inode 137:
> > 
> > # from pre_mount.log
> > # initiate by xfs_reflink_end_cow()
> > # decrease ref: [0x52, +2]
> > # free cow: [0x155, +2]
> > ----------------------------------------------------------------------------
> > Oper (160): tid: c2350d60  len: 48  clientid: TRANS  flags: none
> > CUI:  #regs: 1  num_extents: 2  id: 0xffff88011b37c680
> > (s: 0x52, l: 2, f: 0x2)
> > (s: 0x155, l: 2, f: 0x4)
> > ----------------------------------------------------------------------------
> > 
> > # map shared: [0x155, +2] to inode 137
> > ----------------------------------------------------------------------------
> > Oper (161): tid: c2350d60  len: 48  clientid: TRANS  flags: none
> > BUI:  #regs: 1  num_extents: 1  id: 0xffff880117f0c300
> > (s: 0x155, l: 2, own: 137, off: 58, f: 0x1)
> > ----------------------------------------------------------------------------
> > 
> > # for CUI Oper (160)
> > ----------------------------------------------------------------------------
> > Oper (194): tid: c2350d60  len: 16  clientid: TRANS  flags: none
> > CUD:  #regs: 1                   id: 0xffff88011b37c680
> > ----------------------------------------------------------------------------
> > 
> > # free rmap: [0x155, +2]
> > ----------------------------------------------------------------------------
> > Oper (197): tid: c2350d60  len: 48  clientid: TRANS  flags: none
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff880124a3ea00
> > (s: 0x155, l: 2, own: -9, off: 0, f: 0x8)
> > ----------------------------------------------------------------------------
> > Oper (198): tid: c2350d60  len: 0  clientid: TRANS  flags: COMMIT
> > ============================================================================
> > 
> > According to the implementation of xfs_defer_finish(), these ops should be completed
> > in the following sequence:
> > 
> > Have been done:
> > (1) CUI: Oper (160)
> > (2) BUI: Oper (161)
> > (3) CUD: Oper (194), for CUI Oper (160)
> > (4) RUI A: Oper (197), free rmap [0x155, 2, -9]
> > 
> > Should be done:
> > (5) BUD: for BUI Oper (161)
> > (6) RUI B: add rmap [0x155, 2, 137]
> > (7) RUD: for RUI A
> > (8) RUD: for RUI B
> > 
> > Actually be done by xlog_recover_process_intents()
> > (5) BUD: for BUI Oper (161)
> > (6) RUI B: add rmap [0x155, 2, 137]
> > (7) RUD: for RUI B
> > (8) RUD: for RUI A
> > 
> > So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
> > then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
> > from the log record in post_mount.log (generated after umount) and the trace
> > print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
> > entry [0x155, 2, -9] are freed.
> 
> Hmm, yes, the BUI replay causes RUI B to be created and attached to the
> transaction it created to replay the BUI... and then we make the fatal
> mistake to finish the BUI's accumulated defer_ops before replaying RUI
> A.  If we were /not/ in log recovery we'd append the new intent items to
> the end of the defer_ops that's coordinating all the intent items, but
> in log recovery each intent starts its own transaction and manages its
> own defer_ops, so everything runs out of order.
> 
> A way to solve this is if xlog_recover_process_intents instead creates
> its own defer_ops and passes that into xlog_recover_process_*ui to
> collect all the /new/ intent items.  Once we're done dealing with the
> intent items that were already in the log, we can create another
> transaction and xfs_defer_finish all the new defer_ops.
> 
> Hard questions: We'd also have to collect block usage requirements for
> the transaction and it's not clear if it's problematic to be collecting
> all these follow-on intents (which may have come from totally different
> defer_ops pre-crash) in a single defer_ops.  I'll have to go have a
> think about this one.

Sorry for letting this one drag on for a month.  It occurred to me that
log recovery is the only user of the filesystem at this point, so we
could simply throw all the free space at the defer_ops cleanup
transaction.  I've come up with a RFC-quality patch that restructures
the intent item replay code and will start a new thread to discuss it.

--D

> > # from post_mount.log
> > ----------------------------------------------------------------------------
> > Oper (2): tid: 3c5adcc1  len: 16  clientid: TRANS  flags: none
> > BUD:  #regs: 1                   id: 0xffff880117f0c300
> > ----------------------------------------------------------------------------
> > 
> > # add rmap [0x155, 2, 137]
> > ----------------------------------------------------------------------------
> > Oper (6): tid: 3c5adcc1  len: 48  clientid: TRANS  flags: none
> > RUI:  #regs: 1  num_extents: 1  id: 0xffff880125bdc2a0
> > (s: 0x155, l: 2, own: 137, off: 58, f: 0x2)
> > ----------------------------------------------------------------------------
> > Oper (7): tid: 3c5adcc1  len: 16  clientid: TRANS  flags: none
> > RUD:  #regs: 1                   id: 0xffff880125bdc2a0
> > ----------------------------------------------------------------------------
> > 
> > # remove rmap [0x155, 2, -9]
> > ----------------------------------------------------------------------------
> > Oper (20): tid: 3c5adcc1  len: 16  clientid: TRANS  flags: none
> > RUD:  #regs: 1                   id: 0xffff880124a3ea00
> > ----------------------------------------------------------------------------
> > 
> > # from trace print
> > 
> > # add rmap [0x155, 2, 137], find [0x12d, 46, -9], and insert [0x155, 2, 137]
> > xfs_rmap_defer: dev 253:48 op 1 agno 0 agbno 341 owner 137 data offset 58 len 2 state 0
> > xfs_rmap_deferred: dev 253:48 op 1 agno 0 agbno 341 owner 137 data offset 58 len 2 state 0
> > xfs_rmap_map: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> > xfs_rmap_find_left_neighbor_query: dev 253:48 agno 0 agbno 341 len 0 owner 137 offset 58 flags 0x0
> > xfs_rmap_find_left_neighbor_candidate: dev 253:48 agno 0 agbno 301 len 46 owner -9 offset 0 flags 0x0
> > xfs_rmap_insert: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> > xfs_rmap_map_done: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> > 
> > # del rmap [0x155, 2, -9], find [0x155, 2, 137], and remove [0x155, 2, 137]
> > xfs_rmap_deferred: dev 253:48 op 7 agno 0 agbno 341 owner -9 data offset 0 len 2 state 0
> > xfs_rmap_unmap: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0
> > xfs_rmap_lookup_le_range_result: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> 
> So we do a non-overlapped lookup for [0x155, 2, -9] and we get [0x155,
> 2, 137] because the non-overlapped lookups stop when they get to the
> first record for the block number, and -9 comes after 137.  We /ought/
> at least to bail out of _rmap_unmap except for this garbled check:
> 
> /* Make sure the owner matches what we expect to find in the tree. */
> XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner ||
> 			    XFS_RMAP_NON_INODE_OWNER(owner), out_error);
> 
> I think that really ought to be "owner == ltrec.rm_owner || owner ==
> XFS_RMAP_OWN_UNKNOWN" instead of what's there.
> 
> Ofc that will just make it so that recovery blows up and the fs is
> unmountable, which doesn't get us very far if we don't fix the problem
> of the misordered intent item replays.
> 
> --D
> 
> > xfs_rmap_delete: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0
> > xfs_rmap_unmap_done: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0
> > 
> > # del rmap [0x12d, 40, -9]
> > xfs_rmap_unmap: dev 253:48 agno 0 agbno 301 len 40 owner -9 offset 0 flags 0x0
> > # del rmap [0x157, 4, -9]
> > xfs_rmap_unmap: dev 253:48 agno 0 agbno 343 len 4 owner -9 offset 0 flags 0x0
> > xfs_rmap_update: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0
> > 
> > Regards,
> > 
> > Tao
> > 
> > > Let's have a look at the log...
> > > 
> > > # xfs_logprint bugs.img | grep x178 -B3
> > > Oper (426): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> > > CUI:  #regs: 1  num_extents: 2  id: 0xffff8f0e842f74b8
> > > (s: 0x2b, l: 1, f: 0x2) 
> > > (s: 0x178, l: 1, f: 0x4) 
> > > 
> > > ...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from
> > > the refcountbt...
> > > 
> > > ----------------------------------------------------------------------------
> > > Oper (427): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> > > BUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e81f39000
> > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x1) 
> > > 
> > > ...now we map block 376 into the data fork of inode 134 at offset 19...
> > > 
> > > --
> > > ----------------------------------------------------------------------------
> > > Oper (447): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
> > > RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254000
> > > (s: 0x178, l: 1, own: -9, off: 0, f: 0x8) 
> > > 
> > > ...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap,
> > > which ought to remove the record...
> > > 
> > > --
> > > ----------------------------------------------------------------------------
> > > Oper (453): tid: e0ecea6a  len: 48  clientid: TRANS  flags: none
> > > RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
> > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
> > > 
> > > ...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which
> > > should add the [376,1,134,19,0,0,0] that we'd expect to be there...
> > > 
> > > --
> > > ----------------------------------------------------------------------------
> > > Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
> > > RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
> > > 
> > > ...strange, another rmap add?  Did the RUIs fail to complete?
> > > 
> > > # xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)'
> > > RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
> > > RUD:  #regs: 1                   id: 0xffff8f0e48254530
> > > RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> > > RUD:  #regs: 1                   id: 0xffff925955341f20
> > > 
> > > ...the first RUI is logged as operation 453 in tid e0ecea6a, and it has
> > > an RUD at 460.  However, I don't see a COMMIT, so I gather that log
> > > recovery tried to replay the RUI as operation 13 in tid 17c5790c.
> > > 
> > > ----------------------------------------------------------------------------
> > > Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
> > > RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
> > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
> > > ----------------------------------------------------------------------------
> > > Oper (14): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
> > > RUD:  #regs: 1                   id: 0xffff925955341f20
> > > ----------------------------------------------------------------------------
> > > Oper (15): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
> > > RUD:  #regs: 1                   id: 0xffff8f0e48254000
> > > ----------------------------------------------------------------------------
> > > Oper (16): tid: 17c5790c  len: 24  clientid: TRANS  flags: none
> > > BUF:  #regs: 3   start blkno: 40 (0x28)  len: 8  bmap size: 1  flags: 0x2000
> > > Oper (17): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
> > > BUF DATA
> > >  0 33424d52 31000000 ffffffff ffffffff        0 28000000  1000000 38000000 
> > >  8 3674196c 564cfddf 7798e496 591b6410        0 42249385        0  1000000 
> > > 10 ffffffff fdffffff        0        0  1000000  2000000 ffffffff fbffffff 
> > > 18        0        0  3000000  2000000 ffffffff faffffff        0        0 
> > > 
> > > Oper (18): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
> > > BUF DATA
> > >  0        0        0 4d010000  9000000        0 8a000000        0 2a000000 
> > >  8 56010000  d000000 ffffffff f7ffffff        0        0 65010000 6b000000 
> > > 10 ffffffff f7ffffff        0        0 78010000  1000000        0 86000000 
> > > 18        0 13000000        0        0        0        0        0        0 
> > > 
> > > ----------------------------------------------------------------------------
> > > Oper (19): tid: 17c5790c  len: 0  clientid: TRANS  flags: COMMIT 
> > > 
> > > Ok, so those last two lines in operation 16 show that we've indeed added the
> > > new record to the rmapbt... <shrug> <confused> <tired>??
> > > 
> > > Are we sure that the log recovery actions actually get checkpointed
> > > correctly?  And just to check my own sanity, we generate a snapshot and
> > > then recover it and run xfs_repair, right?
> > > 
> > > --D
> > > 
> > >>
> > >>>
> > >>> (Normally I'd say send a metadump too for us mere mortals to work with,
> > >>> though I'm about to plunge into weddingland so I likely won't be able to
> > >>> do much until the 18th.)
> > >>>
> > >>
> > >> Attached (used xfs_metadump -ao)
> > >> Soon we will all be gods with powers to replay history ;)
> > >>
> > >>> ((Eric: If this doesn't turn out to be a totally garbage patch, feel
> > >>> free to add it to xfsprogs.))
> > >>>
> > >>> --D
> > >>>
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > .
> > > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html