On Tue, Oct 10, 2017 at 12:18:25PM -0700, Darrick J. Wong wrote: > On Mon, Oct 09, 2017 at 08:48:34PM +0800, Hou Tao wrote: > > Hi, > > > > On 2017/9/19 13:38, Darrick J. Wong wrote: > > > On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote: > > >> On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > >>> On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote: > > >>>> Hi guys, > > >>>> > > >>>> I am getting these errors often when running the crash tests > > >>>> with cloned files (generic/502 in my xfstests patches). > > >>>> > > >>>> Hitting these errors requires first fixing 2 other issues > > >>>> that shadow over this issue: > > >>>> "xfs: fix incorrect log_flushed on fsync" (in master) > > >>>> "xfs: fix leftover CoW extent after truncate" > > >>>> available on my tree based on Darrick's simple fix: > > >>>> https://github.com/amir73il/linux/commits/xfs-fsync > > >>>> > > >>>> I get the errors more often (1 out of 5) on a 100G fs on spinning disk. > > >>>> On a 10G fs on SSD they are less frequent. > > >>>> The log in this email was captured on patched stable 4.9.47 kernel, > > >>>> but I am getting the same errors on patched upstream kernel. > > >>>> > > >>>> I wasn't able to create a deterministic reproducer, so attaching > > >>>> the full log from a failed test along with an IO log that can be > > >>>> replayed on your disk to examine the outcome. > > >>>> > > >>>> Following is the output of fsx process #5, which is the process > > >>>> that wrote the problematic testfile5.mark0 to the log. > > >>>> This process performs only read,zero,fsync before creating > > >>>> the log mark. > > >>>> The file testfile5 was cloned from an origin 256K file before > > >>>> running fsx. > > >>>> Later, I used the random seed 35484 in this log for all > > >>>> processes and it seemed to increase the probability for failure. > > >>>> > > >>>> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P > > >>>> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5 > > >>>> /mnt/scratch/testfile5 > > >>>> Seed set to 35484 > > >>>> file_size=262144 > > >>>> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes) > > >>>> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes) > > >>>> 5: 3 fsync > > >>>> 5: Dumped fsync buffer to testfile5.mark0 > > >>>> > > >>>> In order to get to the crash state you need to get my > > >>>> xfstests replay-log patches and replay the attached log > > >>>> on a >= 100G scratch device: > > >>>> > > >>>> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay > > >>>> $SCRATCH_DEV --end-mark testfile5.mark0 > > >>>> # mount $SCRATCH_DEV $SCRATCH_MNT > > >>>> # umount $SCRATCH_MNT > > >>>> # xfs_repair -n $SCRATCH_DEV > > >>>> Phase 1 - find and verify superblock... > > >>>> Phase 2 - using internal log > > >>>> - zero log... > > >>>> - scan filesystem freespace and inode maps... > > >>>> - found root inode chunk > > >>>> Phase 3 - for each AG... > > >>>> - scan (but don't clear) agi unlinked lists... > > >>>> - process known inodes and perform inode discovery... > > >>>> - agno = 0 > > >>>> > > >>>> fatal error -- illegal state 13 in block map 376 > > >>>> > > >>>> Can anyone provide some insight? > > >>> > > >>> Looks like I missed a couple of extent states in process_bmbt_reclist_int. > > >>> > > >>> What happens if you add the following (only compile tested) patch to > > >>> xfsprogs? > > >> > > >> This is what happens: > > >> > > >> Phase 1 - find and verify superblock... > > >> Phase 2 - using internal log > > >> - zero log... > > >> - scan filesystem freespace and inode maps... > > >> - found root inode chunk > > >> Phase 3 - for each AG... > > >> - scan (but don't clear) agi unlinked lists... > > >> - process known inodes and perform inode discovery... > > >> - agno = 0 > > >> data fork in regular inode 134 claims CoW block 376 > > >> correcting nextents for inode 134 > > >> bad data fork in inode 134 > > >> would have cleared inode 134 > > >> - agno = 1 > > >> - agno = 2 > > >> - agno = 3 > > >> - process newly discovered inodes... > > >> Phase 4 - check for duplicate blocks... > > >> - setting up duplicate extent list... > > >> unknown block state, ag 0, block 376 > > >> unknown block state, ag 1, block 16 > > >> - check for inodes claiming duplicate blocks... > > >> - agno = 0 > > >> - agno = 1 > > >> - agno = 2 > > >> entry "testfile2" in shortform directory 128 references free inode 134 > > >> - agno = 3 > > >> would have junked entry "testfile2" in directory inode 128 > > >> imap claims in-use inode 134 is free, would correct imap > > >> Missing reverse-mapping record for (0/376) len 1 owner 134 off 19 > > >> No modify flag set, skipping phase 5 > > >> Phase 6 - check inode connectivity... > > >> - traversing filesystem ... > > >> - traversal finished ... > > >> - moving disconnected inodes to lost+found ... > > >> Phase 7 - verify link counts... > > >> No modify flag set, skipping filesystem flush and exiting. > > > > > > Hmm. > > > > > > # xfs_db bugs.img > > > xfs_db> inode 134 > > > xfs_db> bmap > > > data offset 0 startblock 24 (0/24) count 19 flag 0 > > > data offset 19 startblock 376 (0/376) count 1 flag 0 > > > data offset 25 startblock 8388624 (1/16) count 4 flag 1 > > > data offset 29 startblock 53 (0/53) count 35 flag 0 > > > > > > Ok, so inode 134 thinks it owns block 376. > > > > > > xfs_db> agf 0 > > > xfs_db> addr refcntroot > > > xfs_db> p > > > ... > > > recs[1-22] = [startblock,blockcount,refcount,cowflag] > > > 1:[24,1,11,0] > > > ... > > > 22:[87,1,6,0] > > > xfs_db> > > > > > > Ok, so the refcountbt doesn't know about block 376, so it's neither > > > shared nor a cow staging extent. > > > > > > xfs_db> agf 0 > > > xfs_db> addr rmaproot > > > xfs_db> p > > > ... > > > recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] > > > 1:[0,1,-3,0,0,0,0] > > > ... > > > 38:[376,1,-9,0,0,0,0] > > > xfs_db> > > > > > > So it seems that the rmapbt is behind the times, since record 38 should > > > be [376,1,134,19,0,0,0]. Either that's because there's something wrong > > > with the regular deferred ops mechanism, or log recovery of deferred ops. > > I think the log recovery of deferred ops is incorrect, and we also need to > > add more checks when manipulating the b-tree of rmap. > > > > I had reproduced the problem on my KVM VM, and i will explain the incorrectness of > > log recovery of deferred ops by using my own example. If addition information is > > needed, please just ask. > > > > The phenomena is the same: a block extent had been assigned to a file, but in the > > rmap b-tree the block extent belongs to COW procedure. > > > > We can get the following output from xfs_db after mounting and umounting the device: > > > > ./src/log-writes/replay-log --replay /dev/vdd --end-mark testfile7.mark0 --log /dev/vde > > > > xfs_logprint -o /dev/vdd > pre_mount.log > > mount /dev/vdd /tmp/vdd > > umount /tmp/vdd > > xfs_logprint -o /dev/vdd > post_mount.log > > > > xfs_db /dev/vdd > > xfs_db> inode 137 > > xfs_db> p u3.bmx[3] > > u3.bmx[3] = [startoff,startblock,blockcount,extentflag] > > 3:[58,341,2,0] > > > > xfs_db> agf 0 > > xfs_db> addr rmaproot > > xfs_db> p recs[44] > > recs[44] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] > > 44:[341,2,-9,0,0,0,0] > > > > The last log record in pre_mount.log (generated before mount) has the following ops > > which are related with inode 137: > > > > # from pre_mount.log > > # initiate by xfs_reflink_end_cow() > > # decrease ref: [0x52, +2] > > # free cow: [0x155, +2] > > ---------------------------------------------------------------------------- > > Oper (160): tid: c2350d60 len: 48 clientid: TRANS flags: none > > CUI: #regs: 1 num_extents: 2 id: 0xffff88011b37c680 > > (s: 0x52, l: 2, f: 0x2) > > (s: 0x155, l: 2, f: 0x4) > > ---------------------------------------------------------------------------- > > > > # map shared: [0x155, +2] to inode 137 > > ---------------------------------------------------------------------------- > > Oper (161): tid: c2350d60 len: 48 clientid: TRANS flags: none > > BUI: #regs: 1 num_extents: 1 id: 0xffff880117f0c300 > > (s: 0x155, l: 2, own: 137, off: 58, f: 0x1) > > ---------------------------------------------------------------------------- > > > > # for CUI Oper (160) > > ---------------------------------------------------------------------------- > > Oper (194): tid: c2350d60 len: 16 clientid: TRANS flags: none > > CUD: #regs: 1 id: 0xffff88011b37c680 > > ---------------------------------------------------------------------------- > > > > # free rmap: [0x155, +2] > > ---------------------------------------------------------------------------- > > Oper (197): tid: c2350d60 len: 48 clientid: TRANS flags: none > > RUI: #regs: 1 num_extents: 1 id: 0xffff880124a3ea00 > > (s: 0x155, l: 2, own: -9, off: 0, f: 0x8) > > ---------------------------------------------------------------------------- > > Oper (198): tid: c2350d60 len: 0 clientid: TRANS flags: COMMIT > > ============================================================================ > > > > According to the implementation of xfs_defer_finish(), these ops should be completed > > in the following sequence: > > > > Have been done: > > (1) CUI: Oper (160) > > (2) BUI: Oper (161) > > (3) CUD: Oper (194), for CUI Oper (160) > > (4) RUI A: Oper (197), free rmap [0x155, 2, -9] > > > > Should be done: > > (5) BUD: for BUI Oper (161) > > (6) RUI B: add rmap [0x155, 2, 137] > > (7) RUD: for RUI A > > (8) RUD: for RUI B > > > > Actually be done by xlog_recover_process_intents() > > (5) BUD: for BUI Oper (161) > > (6) RUI B: add rmap [0x155, 2, 137] > > (7) RUD: for RUI B > > (8) RUD: for RUI A > > > > So the rmap entry [0x155, 2, -9] for COW should be freed firstly, > > then a new rmap entry [0x155, 2, 137] will be added. However, as we can see > > from the log record in post_mount.log (generated after umount) and the trace > > print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap > > entry [0x155, 2, -9] are freed. > > Hmm, yes, the BUI replay causes RUI B to be created and attached to the > transaction it created to replay the BUI... and then we make the fatal > mistake to finish the BUI's accumulated defer_ops before replaying RUI > A. If we were /not/ in log recovery we'd append the new intent items to > the end of the defer_ops that's coordinating all the intent items, but > in log recovery each intent starts its own transaction and manages its > own defer_ops, so everything runs out of order. > > A way to solve this is if xlog_recover_process_intents instead creates > its own defer_ops and passes that into xlog_recover_process_*ui to > collect all the /new/ intent items. Once we're done dealing with the > intent items that were already in the log, we can create another > transaction and xfs_defer_finish all the new defer_ops. > > Hard questions: We'd also have to collect block usage requirements for > the transaction and it's not clear if it's problematic to be collecting > all these follow-on intents (which may have come from totally different > defer_ops pre-crash) in a single defer_ops. I'll have to go have a > think about this one. Sorry for letting this one drag on for a month. It occurred to me that log recovery is the only user of the filesystem at this point, so we could simply throw all the free space at the defer_ops cleanup transaction. I've come up with a RFC-quality patch that restructures the intent item replay code and will start a new thread to discuss it. --D > > # from post_mount.log > > ---------------------------------------------------------------------------- > > Oper (2): tid: 3c5adcc1 len: 16 clientid: TRANS flags: none > > BUD: #regs: 1 id: 0xffff880117f0c300 > > ---------------------------------------------------------------------------- > > > > # add rmap [0x155, 2, 137] > > ---------------------------------------------------------------------------- > > Oper (6): tid: 3c5adcc1 len: 48 clientid: TRANS flags: none > > RUI: #regs: 1 num_extents: 1 id: 0xffff880125bdc2a0 > > (s: 0x155, l: 2, own: 137, off: 58, f: 0x2) > > ---------------------------------------------------------------------------- > > Oper (7): tid: 3c5adcc1 len: 16 clientid: TRANS flags: none > > RUD: #regs: 1 id: 0xffff880125bdc2a0 > > ---------------------------------------------------------------------------- > > > > # remove rmap [0x155, 2, -9] > > ---------------------------------------------------------------------------- > > Oper (20): tid: 3c5adcc1 len: 16 clientid: TRANS flags: none > > RUD: #regs: 1 id: 0xffff880124a3ea00 > > ---------------------------------------------------------------------------- > > > > # from trace print > > > > # add rmap [0x155, 2, 137], find [0x12d, 46, -9], and insert [0x155, 2, 137] > > xfs_rmap_defer: dev 253:48 op 1 agno 0 agbno 341 owner 137 data offset 58 len 2 state 0 > > xfs_rmap_deferred: dev 253:48 op 1 agno 0 agbno 341 owner 137 data offset 58 len 2 state 0 > > xfs_rmap_map: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0 > > xfs_rmap_find_left_neighbor_query: dev 253:48 agno 0 agbno 341 len 0 owner 137 offset 58 flags 0x0 > > xfs_rmap_find_left_neighbor_candidate: dev 253:48 agno 0 agbno 301 len 46 owner -9 offset 0 flags 0x0 > > xfs_rmap_insert: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0 > > xfs_rmap_map_done: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0 > > > > # del rmap [0x155, 2, -9], find [0x155, 2, 137], and remove [0x155, 2, 137] > > xfs_rmap_deferred: dev 253:48 op 7 agno 0 agbno 341 owner -9 data offset 0 len 2 state 0 > > xfs_rmap_unmap: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0 > > xfs_rmap_lookup_le_range_result: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0 > > So we do a non-overlapped lookup for [0x155, 2, -9] and we get [0x155, > 2, 137] because the non-overlapped lookups stop when they get to the > first record for the block number, and -9 comes after 137. We /ought/ > at least to bail out of _rmap_unmap except for this garbled check: > > /* Make sure the owner matches what we expect to find in the tree. */ > XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner || > XFS_RMAP_NON_INODE_OWNER(owner), out_error); > > I think that really ought to be "owner == ltrec.rm_owner || owner == > XFS_RMAP_OWN_UNKNOWN" instead of what's there. > > Ofc that will just make it so that recovery blows up and the fs is > unmountable, which doesn't get us very far if we don't fix the problem > of the misordered intent item replays. > > --D > > > xfs_rmap_delete: dev 253:48 agno 0 agbno 341 len 2 owner 137 offset 58 flags 0x0 > > xfs_rmap_unmap_done: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0 > > > > # del rmap [0x12d, 40, -9] > > xfs_rmap_unmap: dev 253:48 agno 0 agbno 301 len 40 owner -9 offset 0 flags 0x0 > > # del rmap [0x157, 4, -9] > > xfs_rmap_unmap: dev 253:48 agno 0 agbno 343 len 4 owner -9 offset 0 flags 0x0 > > xfs_rmap_update: dev 253:48 agno 0 agbno 341 len 2 owner -9 offset 0 flags 0x0 > > > > Regards, > > > > Tao > > > > > Let's have a look at the log... > > > > > > # xfs_logprint bugs.img | grep x178 -B3 > > > Oper (426): tid: 5c75a3f len: 48 clientid: TRANS flags: none > > > CUI: #regs: 1 num_extents: 2 id: 0xffff8f0e842f74b8 > > > (s: 0x2b, l: 1, f: 0x2) > > > (s: 0x178, l: 1, f: 0x4) > > > > > > ...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from > > > the refcountbt... > > > > > > ---------------------------------------------------------------------------- > > > Oper (427): tid: 5c75a3f len: 48 clientid: TRANS flags: none > > > BUI: #regs: 1 num_extents: 1 id: 0xffff8f0e81f39000 > > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x1) > > > > > > ...now we map block 376 into the data fork of inode 134 at offset 19... > > > > > > -- > > > ---------------------------------------------------------------------------- > > > Oper (447): tid: 5c75a3f len: 48 clientid: TRANS flags: none > > > RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254000 > > > (s: 0x178, l: 1, own: -9, off: 0, f: 0x8) > > > > > > ...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap, > > > which ought to remove the record... > > > > > > -- > > > ---------------------------------------------------------------------------- > > > Oper (453): tid: e0ecea6a len: 48 clientid: TRANS flags: none > > > RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254530 > > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) > > > > > > ...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which > > > should add the [376,1,134,19,0,0,0] that we'd expect to be there... > > > > > > -- > > > ---------------------------------------------------------------------------- > > > Oper (13): tid: 17c5790c len: 48 clientid: TRANS flags: none > > > RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 > > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) > > > > > > ...strange, another rmap add? Did the RUIs fail to complete? > > > > > > # xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)' > > > RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254530 > > > RUD: #regs: 1 id: 0xffff8f0e48254530 > > > RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 > > > RUD: #regs: 1 id: 0xffff925955341f20 > > > > > > ...the first RUI is logged as operation 453 in tid e0ecea6a, and it has > > > an RUD at 460. However, I don't see a COMMIT, so I gather that log > > > recovery tried to replay the RUI as operation 13 in tid 17c5790c. > > > > > > ---------------------------------------------------------------------------- > > > Oper (13): tid: 17c5790c len: 48 clientid: TRANS flags: none > > > RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 > > > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) > > > ---------------------------------------------------------------------------- > > > Oper (14): tid: 17c5790c len: 16 clientid: TRANS flags: none > > > RUD: #regs: 1 id: 0xffff925955341f20 > > > ---------------------------------------------------------------------------- > > > Oper (15): tid: 17c5790c len: 16 clientid: TRANS flags: none > > > RUD: #regs: 1 id: 0xffff8f0e48254000 > > > ---------------------------------------------------------------------------- > > > Oper (16): tid: 17c5790c len: 24 clientid: TRANS flags: none > > > BUF: #regs: 3 start blkno: 40 (0x28) len: 8 bmap size: 1 flags: 0x2000 > > > Oper (17): tid: 17c5790c len: 128 clientid: TRANS flags: none > > > BUF DATA > > > 0 33424d52 31000000 ffffffff ffffffff 0 28000000 1000000 38000000 > > > 8 3674196c 564cfddf 7798e496 591b6410 0 42249385 0 1000000 > > > 10 ffffffff fdffffff 0 0 1000000 2000000 ffffffff fbffffff > > > 18 0 0 3000000 2000000 ffffffff faffffff 0 0 > > > > > > Oper (18): tid: 17c5790c len: 128 clientid: TRANS flags: none > > > BUF DATA > > > 0 0 0 4d010000 9000000 0 8a000000 0 2a000000 > > > 8 56010000 d000000 ffffffff f7ffffff 0 0 65010000 6b000000 > > > 10 ffffffff f7ffffff 0 0 78010000 1000000 0 86000000 > > > 18 0 13000000 0 0 0 0 0 0 > > > > > > ---------------------------------------------------------------------------- > > > Oper (19): tid: 17c5790c len: 0 clientid: TRANS flags: COMMIT > > > > > > Ok, so those last two lines in operation 16 show that we've indeed added the > > > new record to the rmapbt... <shrug> <confused> <tired>?? > > > > > > Are we sure that the log recovery actions actually get checkpointed > > > correctly? And just to check my own sanity, we generate a snapshot and > > > then recover it and run xfs_repair, right? > > > > > > --D > > > > > >> > > >>> > > >>> (Normally I'd say send a metadump too for us mere mortals to work with, > > >>> though I'm about to plunge into weddingland so I likely won't be able to > > >>> do much until the 18th.) > > >>> > > >> > > >> Attached (used xfs_metadump -ao) > > >> Soon we will all be gods with powers to replay history ;) > > >> > > >>> ((Eric: If this doesn't turn out to be a totally garbage patch, feel > > >>> free to add it to xfsprogs.)) > > >>> > > >>> --D > > >>> > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > . > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html