Re: xfs clones crash issue - illegal state 13 in block map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote:
>> Hi guys,
>>
>> I am getting these errors often when running the crash tests
>> with cloned files (generic/502 in my xfstests patches).
>>
>> Hitting these errors requires first fixing 2 other issues
>> that shadow over this issue:
>> "xfs: fix incorrect log_flushed on fsync" (in master)
>> "xfs: fix leftover CoW extent after truncate"
>> available on my tree based on Darrick's simple fix:
>> https://github.com/amir73il/linux/commits/xfs-fsync
>>
>> I get the errors more often (1 out of 5) on a 100G fs on spinning disk.
>> On a 10G fs on SSD they are less frequent.
>> The log in this email was captured on patched stable 4.9.47 kernel,
>> but I am getting the same errors on patched upstream kernel.
>>
>> I wasn't able to create a deterministic reproducer, so attaching
>> the full log from a failed test along with an IO log that can be
>> replayed on your disk to examine the outcome.
>>
>> Following is the output of fsx process #5, which is the process
>> that wrote the problematic testfile5.mark0 to the log.
>> This process performs only read,zero,fsync before creating
>> the log mark.
>> The file testfile5 was cloned from an origin 256K file before
>> running fsx.
>> Later, I used the random seed 35484 in this log for all
>> processes and it seemed to increase the probability for failure.
>>
>> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P
>> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5
>> /mnt/scratch/testfile5
>> Seed set to 35484
>> file_size=262144
>> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes)
>> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes)
>> 5: 3 fsync
>> 5: Dumped fsync buffer to testfile5.mark0
>>
>> In order to get to the crash state you need to get my
>> xfstests replay-log patches and replay the attached log
>> on a >= 100G scratch device:
>>
>> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay
>> $SCRATCH_DEV --end-mark testfile5.mark0
>> # mount $SCRATCH_DEV $SCRATCH_MNT
>> # umount $SCRATCH_MNT
>> # xfs_repair -n $SCRATCH_DEV
>> Phase 1 - find and verify superblock...
>> Phase 2 - using internal log
>>         - zero log...
>>         - scan filesystem freespace and inode maps...
>>         - found root inode chunk
>> Phase 3 - for each AG...
>>         - scan (but don't clear) agi unlinked lists...
>>         - process known inodes and perform inode discovery...
>>         - agno = 0
>>
>> fatal error -- illegal state 13 in block map 376
>>
>> Can anyone provide some insight?
>
> Looks like I missed a couple of extent states in process_bmbt_reclist_int.
>
> What happens if you add the following (only compile tested) patch to
> xfsprogs?

This is what happens:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
data fork in regular inode 134 claims CoW block 376
correcting nextents for inode 134
bad data fork in inode 134
would have cleared inode 134
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
unknown block state, ag 0, block 376
unknown block state, ag 1, block 16
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
entry "testfile2" in shortform directory 128 references free inode 134
        - agno = 3
would have junked entry "testfile2" in directory inode 128
imap claims in-use inode 134 is free, would correct imap
Missing reverse-mapping record for (0/376) len 1 owner 134 off 19
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

>
> (Normally I'd say send a metadump too for us mere mortals to work with,
> though I'm about to plunge into weddingland so I likely won't be able to
> do much until the 18th.)
>

Attached (used xfs_metadump -ao)
Soon we will all be gods with powers to replay history ;)

> ((Eric: If this doesn't turn out to be a totally garbage patch, feel
> free to add it to xfsprogs.))
>
> --D
>

Attachment: metadump.xfs.testfile5.mark0.bz2
Description: BZip2 compressed data


[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux