Re: XFS internal error XFS_WANT_CORRUPTED_GOTO

Amit Sahrawat <amit.sahrawat83@xxxxxxxxx> · Mon, 25 Jul 2011 10:56:01 +0530



On Sun, Jul 24, 2011 at 7:04 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Jul 22, 2011 at 04:03:53PM +0530, Amit Sahrawat wrote:
>> On Fri, Jul 22, 2011 at 2:04 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > [<c0023000>] (dump_backtrace+0x0/0x110)
>> > [<c02dd668>] (dump_stack+0x0/0x1c)
>> > [<c0176b84>] (xfs_error_report+0x0/0x5c)
>> > [<c0150cd4>] (xfs_free_ag_extent+0x0/0x600)
>> > [<c0152b78>] (xfs_free_extent+0x0/0xa4)
>> > [<c015fea0>] (xfs_bmap_finish+0x0/0x194)
>> > [<c017e664>] (xfs_itruncate_finish+0x0/0x30c)
>> > [<c0197bbc>] (xfs_inactive+0x0/0x40c)
>> > [<c01a3d50>] (xfs_fs_clear_inode+0x0/0x60)
>> > [<c00bdcf8>] (clear_inode+0x0/0xe8)
>> > [<c00be4a8>] (generic_delete_inode+0x0/0x178)
>> > [<c00be620>] (generic_drop_inode+0x0/0x68)
>> > [<c00bd2fc>] (iput+0x0/0x7c)
>> > [<c00b4b44>] (do_unlinkat+0x0/0x154)
>> > [<c00b4c98>] (sys_unlink+0x0/0x1c)
>> >
>> > So, you powered off an active machine while writing to it, and after
>> > it started back up it hit a free space between corruption. And then
>> > you couldn't mount it because log replay was trying to replay the
>> > last committed transaction to the log. That transaction shows inode
>> > 132 being unlinked, added to the AGI unliked list, and then being
>> > inactivated. There is an EFI committed for 1 extent. There is no EFD
>> > committed, so the shutdown occurred during that operation. Log
>> > replay then hits the corruption repeatedly by trying to replay the
>> > EFI to complete the extent free operation.
>> >
>> Yes, it happened exactly as you mentioned above. But the problem
>> happened much earlier.
>
> Obviously.
>
>> While writing the 'reset' happened. At the next reboot - the
>> filesystem mounted even though there
>> was corruption - the file and journal did not match.
>
> You found this how?
Just before removing the file on next 'reboot' - I checked the values
for file extents and the 'by-block' tree free length.
File extents - no. of blocks was correct but the neighbouring blocks
do not match with the occupied blocks in the allocation group.
If a file of 'N' blocks is present in an allocation group, then the
allocation group should now have (X-N) free blocks, where 'X' is the
size of allocation group in blocks.
But in this case - free size of allocation group is (X - M), where M
is greater than 'N'.
which means when there is a free request for that file and then
corresponsing blocks are checked in the allocation group - it will
show the blocks as already free and point to corruption.
Is the above observation  correct?
>
>> And it did not
>> show any error either.
>
> Well, no. If the journal format is intact, then it's contents are
> trusted.
>
>> The moment the issue happened and then if checked through xfs_logprint
>> - the blocks free in the allocation group does not match up properly.
>> I checked by taking the print from xfs_free_ag_extent - it showed the
>> wrong blocks contigous to that file.
>
> Yes, which is why I suspect that the journal contained something it
> shouldn't have.
>
>> Original file which was to be deleted is of '255' blocks and started
>> from physical block number - 12.
>
> Yes, I saw that in the EFI record in the logprint output I quoted.
>
>> While the free space tree has free
>> blocks larger than what it should be. it returns the length starting
>> from block number and the length which is last committed to the log.
>>
>> As I mentioned in the last mail - the details about the observation.
>> Please check.
>
> Going back to the transaction you quoted first: it doesn't match the
> entries in the log print you attached in your second email. I can't
> comment on it at all.
In the last mail - I attached the log where-in 'reset' happened but
there was no corruption. Just to check that what all is recorded in
journal, and this time journal matched with the data on the disk.
>
>> > So, the log and the repair output are useless for determining what
>> > caused the problem - you need the log from the mount *before* the
>> > first shutdown occurred, and to have run repair *before* you
>> > tried to unlink anything.
>>
>> It is only after 'first shutdown' I came to know about the corruption.
>> Since, there was no error shown at the mount time, it did not seemed
>> reasonable enough to run "repair" on the disk.
>
> It's standard practice for power failure integrity testing.
Yes, Even I agree to this. But why was there a mount at first hand? I
guess if it returned some error - xfs_repair could be run.

>
>> But at the same time I checked with the case(reset while direct-IO),
>> when there was no issue - I tried to check the logs the same way as
>> you mentioned
>> - reset - log_print - check using xfs_repair and then mount - the
>> behaviour was as per the logs -  The number of commit transactions and
>> the files created was ok.
>> Logs are attached for xfs_logprint output.
>>
>> mount after taking the log prints.
>>
>> #mount /dev/sdb1 /media/b
>> XFS mounting filesystem sda1
>> Starting XFS recovery on filesystem: sda1 (logdev: internal)
>> Ending XFS recovery on filesystem: sda1 (logdev: internal)
>>
>> #> ls -li /media/b
>>     131 -rwxr-xr-x    1 root     0          2887184 Jan  1 00:00 test_code
>>     132 -rw-r--r--    1 root     0          1044480 Jan  1 00:00
>> direct_io_file_0
>>     133 -rw-r--r--    1 root     0           819200 Jan  1 00:00
>> direct_io_file_1
>
> Last entry in the log for inode 133 is this:
>
> Oper (480): tid: 738ef289  len: 56  clientid: TRANS  flags: none
> INODE: #regs: 3   ino: 0x85  flags: 0x5   dsize: 16
>        blkno: 64  len: 16  boff: 1280
> Oper (481): tid: 738ef289  len: 96  clientid: TRANS  flags: none
> INODE CORE
> magic 0x494e mode 0100644 version 2 format 2
> nlink 1 uid 0 gid 0
> atime 0xc mtime 0xd ctime 0xd
> size 0xc9400 nblocks 0xcb extsize 0x0 nextents 0x1
> naextents 0x0 forkoff 0 dmevmask 0x0 dmstate 0x0
> flags 0x0 gen 0x0
> Oper (482): tid: 738ef289  len: 16  clientid: TRANS  flags: none
> EXTENTS inode data
>
> The inode size is 0xc9400 = 824,320 bytes, and blocks = 0xcb = 203 =
> 831,488 bytes. So there's nothing wrong here. Indeed, it appears
> this transaction was not replayed because:
>
> ----------------------------------------------------------------------------
> Oper (477): tid: 44eff979  len: 0  clientid: TRANS  flags: COMMIT
> Oper (478): tid: 738ef289  len: 0  clientid: TRANS  flags: START
> Oper (479): tid: 738ef289  len: 16  clientid: TRANS  flags: none
> Oper (480): tid: 738ef289  len: 56  clientid: TRANS  flags: none
> Oper (481): tid: 738ef289  len: 96  clientid: TRANS  flags: none
> Oper (482): tid: 738ef289  len: 16  clientid: TRANS  flags: none
> Oper (483): tid: 738ef289  len: 24  clientid: TRANS  flags: none
> Oper (484): tid: 738ef289  len: 128  clientid: TRANS  flags: none
> Oper (485): tid: 738ef289  len: 28  clientid: TRANS  flags: none
> Oper (486): tid: 738ef289  len: 128  clientid: TRANS  flags: none
> Oper (487): tid: 738ef289  len: 20  clientid: TRANS  flags: CONTINUE
> BUF:  #regs: 2   Not printing rest of data
>
> ============================================================================
> xfs_logprint: skipped 3510 cleared blocks in range: 592 - 4101
> xfs_logprint: skipped 61434 zeroed blocks in range: 4102 - 65535
> xfs_logprint: physical end of log
> ============================================================================
> xfs_logprint: logical end of log
> ============================================================================
>
>
> The transaction was never committed completely in the journal. The
> last complete transaction in the journal matches up with the inode
> size in your ls -l output about.
Yes, this is what I wanted to show using the latest log prints. But
this is not case when there is an issue.
>
> Your test code is obviously doing single block allocation (probably
> 4k DIO append writes). I can't see any inconsistencies with what is
The test case is just writing to a file using DIO. Write buffer
request is '5K' - issue write request in a loop - 204 times. Final
File size - 1020K.

> in the journal, with what you see from ls -l, nor any
> inconsistencies with the AGF spce accounting. Hence I suspect that
> it's either a barrier problem, or perhaps yet another manifestation
> of vmap cache aliasing causing random corruption during journal
> replay. Can you verify that the transactions being replayed match
> the contents of the logprint output (i.e. the entire transactions
> including buffer data matches, and not just the transaction IDs).

Barrier is enabled on my target and it does not have cache aliasing
problem either.
I have ARM cortex a9 - single core.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs