On Tue, Sep 19, 2017 at 8:38 AM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote: >> On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: >> > On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote: >> >> Hi guys, >> >> >> >> I am getting these errors often when running the crash tests >> >> with cloned files (generic/502 in my xfstests patches). >> >> >> >> Hitting these errors requires first fixing 2 other issues >> >> that shadow over this issue: >> >> "xfs: fix incorrect log_flushed on fsync" (in master) >> >> "xfs: fix leftover CoW extent after truncate" >> >> available on my tree based on Darrick's simple fix: >> >> https://github.com/amir73il/linux/commits/xfs-fsync >> >> >> >> I get the errors more often (1 out of 5) on a 100G fs on spinning disk. >> >> On a 10G fs on SSD they are less frequent. >> >> The log in this email was captured on patched stable 4.9.47 kernel, >> >> but I am getting the same errors on patched upstream kernel. >> >> >> >> I wasn't able to create a deterministic reproducer, so attaching >> >> the full log from a failed test along with an IO log that can be >> >> replayed on your disk to examine the outcome. >> >> >> >> Following is the output of fsx process #5, which is the process >> >> that wrote the problematic testfile5.mark0 to the log. >> >> This process performs only read,zero,fsync before creating >> >> the log mark. >> >> The file testfile5 was cloned from an origin 256K file before >> >> running fsx. >> >> Later, I used the random seed 35484 in this log for all >> >> processes and it seemed to increase the probability for failure. >> >> >> >> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P >> >> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5 >> >> /mnt/scratch/testfile5 >> >> Seed set to 35484 >> >> file_size=262144 >> >> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes) >> >> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes) >> >> 5: 3 fsync >> >> 5: Dumped fsync buffer to testfile5.mark0 >> >> >> >> In order to get to the crash state you need to get my >> >> xfstests replay-log patches and replay the attached log >> >> on a >= 100G scratch device: >> >> >> >> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay >> >> $SCRATCH_DEV --end-mark testfile5.mark0 >> >> # mount $SCRATCH_DEV $SCRATCH_MNT >> >> # umount $SCRATCH_MNT >> >> # xfs_repair -n $SCRATCH_DEV >> >> Phase 1 - find and verify superblock... >> >> Phase 2 - using internal log >> >> - zero log... >> >> - scan filesystem freespace and inode maps... >> >> - found root inode chunk >> >> Phase 3 - for each AG... >> >> - scan (but don't clear) agi unlinked lists... >> >> - process known inodes and perform inode discovery... >> >> - agno = 0 >> >> >> >> fatal error -- illegal state 13 in block map 376 >> >> >> >> Can anyone provide some insight? >> > >> > Looks like I missed a couple of extent states in process_bmbt_reclist_int. >> > >> > What happens if you add the following (only compile tested) patch to >> > xfsprogs? >> >> This is what happens: >> >> Phase 1 - find and verify superblock... >> Phase 2 - using internal log >> - zero log... >> - scan filesystem freespace and inode maps... >> - found root inode chunk >> Phase 3 - for each AG... >> - scan (but don't clear) agi unlinked lists... >> - process known inodes and perform inode discovery... >> - agno = 0 >> data fork in regular inode 134 claims CoW block 376 >> correcting nextents for inode 134 >> bad data fork in inode 134 >> would have cleared inode 134 >> - agno = 1 >> - agno = 2 >> - agno = 3 >> - process newly discovered inodes... >> Phase 4 - check for duplicate blocks... >> - setting up duplicate extent list... >> unknown block state, ag 0, block 376 >> unknown block state, ag 1, block 16 >> - check for inodes claiming duplicate blocks... >> - agno = 0 >> - agno = 1 >> - agno = 2 >> entry "testfile2" in shortform directory 128 references free inode 134 >> - agno = 3 >> would have junked entry "testfile2" in directory inode 128 >> imap claims in-use inode 134 is free, would correct imap >> Missing reverse-mapping record for (0/376) len 1 owner 134 off 19 >> No modify flag set, skipping phase 5 >> Phase 6 - check inode connectivity... >> - traversing filesystem ... >> - traversal finished ... >> - moving disconnected inodes to lost+found ... >> Phase 7 - verify link counts... >> No modify flag set, skipping filesystem flush and exiting. > > Hmm. > > # xfs_db bugs.img > xfs_db> inode 134 > xfs_db> bmap > data offset 0 startblock 24 (0/24) count 19 flag 0 > data offset 19 startblock 376 (0/376) count 1 flag 0 > data offset 25 startblock 8388624 (1/16) count 4 flag 1 > data offset 29 startblock 53 (0/53) count 35 flag 0 > > Ok, so inode 134 thinks it owns block 376. > > xfs_db> agf 0 > xfs_db> addr refcntroot > xfs_db> p > ... > recs[1-22] = [startblock,blockcount,refcount,cowflag] > 1:[24,1,11,0] > ... > 22:[87,1,6,0] > xfs_db> > > Ok, so the refcountbt doesn't know about block 376, so it's neither > shared nor a cow staging extent. > > xfs_db> agf 0 > xfs_db> addr rmaproot > xfs_db> p > ... > recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] > 1:[0,1,-3,0,0,0,0] > ... > 38:[376,1,-9,0,0,0,0] > xfs_db> > > So it seems that the rmapbt is behind the times, since record 38 should > be [376,1,134,19,0,0,0]. Either that's because there's something wrong > with the regular deferred ops mechanism, or log recovery of deferred ops. > Let's have a look at the log... > > # xfs_logprint bugs.img | grep x178 -B3 > Oper (426): tid: 5c75a3f len: 48 clientid: TRANS flags: none > CUI: #regs: 1 num_extents: 2 id: 0xffff8f0e842f74b8 > (s: 0x2b, l: 1, f: 0x2) > (s: 0x178, l: 1, f: 0x4) > > ...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from > the refcountbt... > > ---------------------------------------------------------------------------- > Oper (427): tid: 5c75a3f len: 48 clientid: TRANS flags: none > BUI: #regs: 1 num_extents: 1 id: 0xffff8f0e81f39000 > (s: 0x178, l: 1, own: 134, off: 19, f: 0x1) > > ...now we map block 376 into the data fork of inode 134 at offset 19... > > -- > ---------------------------------------------------------------------------- > Oper (447): tid: 5c75a3f len: 48 clientid: TRANS flags: none > RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254000 > (s: 0x178, l: 1, own: -9, off: 0, f: 0x8) > > ...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap, > which ought to remove the record... > > -- > ---------------------------------------------------------------------------- > Oper (453): tid: e0ecea6a len: 48 clientid: TRANS flags: none > RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254530 > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) > > ...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which > should add the [376,1,134,19,0,0,0] that we'd expect to be there... > > -- > ---------------------------------------------------------------------------- > Oper (13): tid: 17c5790c len: 48 clientid: TRANS flags: none > RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) > > ...strange, another rmap add? Did the RUIs fail to complete? > > # xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)' > RUI: #regs: 1 num_extents: 1 id: 0xffff8f0e48254530 > RUD: #regs: 1 id: 0xffff8f0e48254530 > RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 > RUD: #regs: 1 id: 0xffff925955341f20 > > ...the first RUI is logged as operation 453 in tid e0ecea6a, and it has > an RUD at 460. However, I don't see a COMMIT, so I gather that log > recovery tried to replay the RUI as operation 13 in tid 17c5790c. > > ---------------------------------------------------------------------------- > Oper (13): tid: 17c5790c len: 48 clientid: TRANS flags: none > RUI: #regs: 1 num_extents: 1 id: 0xffff925955341f20 > (s: 0x178, l: 1, own: 134, off: 19, f: 0x2) > ---------------------------------------------------------------------------- > Oper (14): tid: 17c5790c len: 16 clientid: TRANS flags: none > RUD: #regs: 1 id: 0xffff925955341f20 > ---------------------------------------------------------------------------- > Oper (15): tid: 17c5790c len: 16 clientid: TRANS flags: none > RUD: #regs: 1 id: 0xffff8f0e48254000 > ---------------------------------------------------------------------------- > Oper (16): tid: 17c5790c len: 24 clientid: TRANS flags: none > BUF: #regs: 3 start blkno: 40 (0x28) len: 8 bmap size: 1 flags: 0x2000 > Oper (17): tid: 17c5790c len: 128 clientid: TRANS flags: none > BUF DATA > 0 33424d52 31000000 ffffffff ffffffff 0 28000000 1000000 38000000 > 8 3674196c 564cfddf 7798e496 591b6410 0 42249385 0 1000000 > 10 ffffffff fdffffff 0 0 1000000 2000000 ffffffff fbffffff > 18 0 0 3000000 2000000 ffffffff faffffff 0 0 > > Oper (18): tid: 17c5790c len: 128 clientid: TRANS flags: none > BUF DATA > 0 0 0 4d010000 9000000 0 8a000000 0 2a000000 > 8 56010000 d000000 ffffffff f7ffffff 0 0 65010000 6b000000 > 10 ffffffff f7ffffff 0 0 78010000 1000000 0 86000000 > 18 0 13000000 0 0 0 0 0 0 > > ---------------------------------------------------------------------------- > Oper (19): tid: 17c5790c len: 0 clientid: TRANS flags: COMMIT > > Ok, so those last two lines in operation 16 show that we've indeed added the > new record to the rmapbt... <shrug> <confused> <tired>?? > > Are we sure that the log recovery actions actually get checkpointed > correctly? I'm not sure of anything. dm-log-writes may have bugs, but so far all suspicions rebounded back to file system bugs. I'm pretty sure, though, that dm-log-writes does not discriminate log recovery actions, it is simply not smart enough for that. To better understand what happenened on block level, I added the range filter options --start-sector --end-sector to replay-log tool. If you narrowed down the problem to a physical block whose content is in dispute (which was the case with data corruption bug), you can use replay-log --vv on the said physical range to study the case. replay-log is now in upstream xfstests and so is test generic/457 which reproduces this error. > And just to check my own sanity, we generate a snapshot and > then recover it and run xfs_repair, right? > Let me rephrase that for clarity of terms: - We record all IO since before mkfs - We write a snapshot mark in the recording at times (post fsync in this test) Then - We replay IO on device until a mark (i.e. restore from snapshot) - We mount to recover xfs log - We verify content of fsynced file - We umount and check fs I sent you the metadump from this point in time. I can send you metadump from before mount, so you can examine that as well. Or you can do on a >= 100G scratch device: # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay $SCRATCH_DEV --end-mark testfile5.mark0 Let me know if you need anything else. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html