On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote: > On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote: > >> Hi guys, > >> > >> I am getting these errors often when running the crash tests > >> with cloned files (generic/502 in my xfstests patches). > >> > >> Hitting these errors requires first fixing 2 other issues > >> that shadow over this issue: > >> "xfs: fix incorrect log_flushed on fsync" (in master) > >> "xfs: fix leftover CoW extent after truncate" > >> available on my tree based on Darrick's simple fix: > >> https://github.com/amir73il/linux/commits/xfs-fsync > >> > >> I get the errors more often (1 out of 5) on a 100G fs on spinning disk. > >> On a 10G fs on SSD they are less frequent. > >> The log in this email was captured on patched stable 4.9.47 kernel, > >> but I am getting the same errors on patched upstream kernel. > >> > >> I wasn't able to create a deterministic reproducer, so attaching > >> the full log from a failed test along with an IO log that can be > >> replayed on your disk to examine the outcome. > >> > >> Following is the output of fsx process #5, which is the process > >> that wrote the problematic testfile5.mark0 to the log. > >> This process performs only read,zero,fsync before creating > >> the log mark. > >> The file testfile5 was cloned from an origin 256K file before > >> running fsx. > >> Later, I used the random seed 35484 in this log for all > >> processes and it seemed to increase the probability for failure. > >> > >> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P > >> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5 > >> /mnt/scratch/testfile5 > >> Seed set to 35484 > >> file_size=262144 > >> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes) > >> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes) > >> 5: 3 fsync > >> 5: Dumped fsync buffer to testfile5.mark0 > >> > >> In order to get to the crash state you need to get my > >> xfstests replay-log patches and replay the attached log > >> on a >= 100G scratch device: > >> > >> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay > >> $SCRATCH_DEV --end-mark testfile5.mark0 > >> # mount $SCRATCH_DEV $SCRATCH_MNT > >> # umount $SCRATCH_MNT > >> # xfs_repair -n $SCRATCH_DEV > >> Phase 1 - find and verify superblock... > >> Phase 2 - using internal log > >> - zero log... > >> - scan filesystem freespace and inode maps... > >> - found root inode chunk > >> Phase 3 - for each AG... > >> - scan (but don't clear) agi unlinked lists... > >> - process known inodes and perform inode discovery... > >> - agno = 0 > >> > >> fatal error -- illegal state 13 in block map 376 > >> > >> Can anyone provide some insight? > > > > Looks like I missed a couple of extent states in process_bmbt_reclist_int. > > > > What happens if you add the following (only compile tested) patch to > > xfsprogs? > > This is what happens: > > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan (but don't clear) agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > data fork in regular inode 134 claims CoW block 376 > correcting nextents for inode 134 > bad data fork in inode 134 > would have cleared inode 134 > - agno = 1 > - agno = 2 > - agno = 3 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > unknown block state, ag 0, block 376 > unknown block state, ag 1, block 16 > - check for inodes claiming duplicate blocks... > - agno = 0 > - agno = 1 > - agno = 2 > entry "testfile2" in shortform directory 128 references free inode 134 > - agno = 3 > would have junked entry "testfile2" in directory inode 128 > imap claims in-use inode 134 is free, would correct imap > Missing reverse-mapping record for (0/376) len 1 owner 134 off 19 > No modify flag set, skipping phase 5 > Phase 6 - check inode connectivity... > - traversing filesystem ... > - traversal finished ... > - moving disconnected inodes to lost+found ... > Phase 7 - verify link counts... > No modify flag set, skipping filesystem flush and exiting. Hmm. # xfs_db bugs.img xfs_db> inode 134 xfs_db> bmap data offset 0 startblock 24 (0/24) count 19 flag 0 data offset 19 startblock 376 (0/376) count 1 flag 0 data offset 25 startblock 8388624 (1/16) count 4 flag 1 data offset 29 startblock 53 (0/53) count 35 flag 0 Ok, so inode 134 thinks it owns block 376. xfs_db> agf 0 xfs_db> addr refcntroot xfs_db> p ... recs[1-22] = [startblock,blockcount,refcount,cowflag] 1:[24,1,11,0] ... 22:[87,1,6,0] xfs_db> Ok, so the refcountbt doesn't know about block 376, so it's neither shared nor a cow staging extent. xfs_db> agf 0 xfs_db> addr rmaproot xfs_db> p ... recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] 1:[0,1,-3,0,0,0,0] ... 38:[376,1,-9,0,0,0,0] xfs_db> So it seems that the rmapbt is behind the times, since record 38 should be [376,1,134,19,0,0,0]. Either that's because there's something wrong with the regular deferred ops mechanism, or log recovery of deferred ops. Let's have a look at the log... # xfs_logprint bugs.img | grep x178 -B3 Oper (426): tid: 5c75a3f len: 48 clientid: TRANS flags: none CUI: #regs: 1 num_extents: 2 id: 0xffff8f0e842f74b8 (s: 0x2b, l: 1, f: 0x2) (s: 0x178, l: 1, f: 0x4) ...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from the refcountbt... ---------------------------------------------------------------------------- Oper (427): tid: 5c75a3f len: 48 clientid: TRANS flags: none BUI: #regs: 1 num_extents: 1 id: 0xffff8f0e81f39000 (s: 0x178, l: 1, own: 134, off: 19, f: 0x1) ...now we map block 376 into the data fork of inode 134 at offset 19... -- ---------------------------------------------------------------------------- Oper (447): tid: 5c75a3f len: 48 clientid: TRANS flags: none RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254000 (s: 0x178, l: 1, own: -9, off: 0, f: 0x8) ...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap, which ought to remove the record... -- ---------------------------------------------------------------------------- Oper (453): tid: e0ecea6a len: 48 clientid: TRANS flags: none RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254530 (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) ...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which should add the [376,1,134,19,0,0,0] that we'd expect to be there... -- ---------------------------------------------------------------------------- Oper (13): tid: 17c5790c len: 48 clientid: TRANS flags: none RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) ...strange, another rmap add? Did the RUIs fail to complete? # xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)' RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254530 RUD: #regs: 1 id: 0xffff8f0e48254530 RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 RUD: #regs: 1 id: 0xffff925955341f20 ...the first RUI is logged as operation 453 in tid e0ecea6a, and it has an RUD at 460. However, I don't see a COMMIT, so I gather that log recovery tried to replay the RUI as operation 13 in tid 17c5790c. ---------------------------------------------------------------------------- Oper (13): tid: 17c5790c len: 48 clientid: TRANS flags: none RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) ---------------------------------------------------------------------------- Oper (14): tid: 17c5790c len: 16 clientid: TRANS flags: none RUD: #regs: 1 id: 0xffff925955341f20 ---------------------------------------------------------------------------- Oper (15): tid: 17c5790c len: 16 clientid: TRANS flags: none RUD: #regs: 1 id: 0xffff8f0e48254000 ---------------------------------------------------------------------------- Oper (16): tid: 17c5790c len: 24 clientid: TRANS flags: none BUF: #regs: 3 start blkno: 40 (0x28) len: 8 bmap size: 1 flags: 0x2000 Oper (17): tid: 17c5790c len: 128 clientid: TRANS flags: none BUF DATA 0 33424d52 31000000 ffffffff ffffffff 0 28000000 1000000 38000000 8 3674196c 564cfddf 7798e496 591b6410 0 42249385 0 1000000 10 ffffffff fdffffff 0 0 1000000 2000000 ffffffff fbffffff 18 0 0 3000000 2000000 ffffffff faffffff 0 0 Oper (18): tid: 17c5790c len: 128 clientid: TRANS flags: none BUF DATA 0 0 0 4d010000 9000000 0 8a000000 0 2a000000 8 56010000 d000000 ffffffff f7ffffff 0 0 65010000 6b000000 10 ffffffff f7ffffff 0 0 78010000 1000000 0 86000000 18 0 13000000 0 0 0 0 0 0 ---------------------------------------------------------------------------- Oper (19): tid: 17c5790c len: 0 clientid: TRANS flags: COMMIT Ok, so those last two lines in operation 16 show that we've indeed added the new record to the rmapbt... <shrug> <confused> <tired>?? Are we sure that the log recovery actions actually get checkpointed correctly? And just to check my own sanity, we generate a snapshot and then recover it and run xfs_repair, right? --D > > > > > (Normally I'd say send a metadump too for us mere mortals to work with, > > though I'm about to plunge into weddingland so I likely won't be able to > > do much until the 18th.) > > > > Attached (used xfs_metadump -ao) > Soon we will all be gods with powers to replay history ;) > > > ((Eric: If this doesn't turn out to be a totally garbage patch, feel > > free to add it to xfsprogs.)) > > > > --D > > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html