Re: XFS internal error XFS_WANT_CORRUPTED_GOTO

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 22 Jul 2011 18:34:57 +1000

[Amit - please don't top post. If you are going to quote the
previous email, please reply after the quoted text. ]

[Alex - I haven't seen either of the previous two emails in this
thread from Amit - has SGI made it onto a spam-blocking RBL again? ]

On Fri, Jul 22, 2011 at 12:29:21PM +0530, Amit Sahrawat wrote:
> On Fri, Jul 22, 2011 at 10:53 AM, Amit Sahrawat
> <amit.sahrawat83@xxxxxxxxx> wrote:
> > On Fri, Jul 22, 2011 at 10:22 AM, Amit Sahrawat
> > <amit.sahrawat83@xxxxxxxxx> wrote:
> >> Dear All,
> >>
> >> Target : ARM
> >>
> >> Recently I encountered a corruption on XFS for RC-3. While the
> >> DIRECT-IO for a file was in operation (Write operation) there was a
> >> power reset - Only one file at a time is being written to the disk
> >> using DIO.. After reboot on mounting I just tried to remove the file
> >> and encountered the below mentioned corruption.  The hard disk is not
> >> able to mount after this, only after clearing logs (xfs_repair –L) –

Lots of weird characters in your email...

> >> disk is able to mount
> >> XFS mounting filesystem sda1
> >> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1535 of file
> >> fs/xfs/xfs_alloc.c.  Caller 0xc0152c04
> >> Backtrace:
> >> [<c0023000>] (dump_backtrace+0x0/0x110) from [<c02dd680>] (dump_stack+0x18/0x1c)
> >>  r6:00000000 r5:c0152c04 r4:00000075 r3:e3ec1c88
> >> [<c02dd668>] (dump_stack+0x0/0x1c) from [<c0176bd0>]
> >> (xfs_error_report+0x4c/0x5c)
> >> [<c0176b84>] (xfs_error_report+0x0/0x5c) from [<c01510d4>]
> >> (xfs_free_ag_extent+0x400/0x600)
> >> [<c0150cd4>] (xfs_free_ag_extent+0x0/0x600) from [<c0152c04>]
> >> (xfs_free_extent+0x8c/0xa4)
> >> [<c0152b78>] (xfs_free_extent+0x0/0xa4) from [<c015ffa8>]
> >> (xfs_bmap_finish+0x108/0x194)
> >>  r7:e3ec1e10 r6:00000000 r5:e3737870 r4:e373e000
> >> [<c015fea0>] (xfs_bmap_finish+0x0/0x194) from [<c017e840>]
> >> (xfs_itruncate_finish+0x1dc/0x30c)
> >> [<c017e664>] (xfs_itruncate_finish+0x0/0x30c) from [<c0197dc8>]
> >> (xfs_inactive+0x20c/0x40c)
> >> [<c0197bbc>] (xfs_inactive+0x0/0x40c) from [<c01a3da0>]
> >> (xfs_fs_clear_inode+0x50/0x60)
> >>  r9:e3ec0000 r8:c001f128 r7:00000000 r6:e4671a80 r5:c0312454
> >> r4:e4667300
> >> [<c01a3d50>] (xfs_fs_clear_inode+0x0/0x60) from [<c00bdd84>]
> >> (clear_inode+0x8c/0xe8)
> >>  r4:e4667420 r3:c01a3d50
> >> [<c00bdcf8>] (clear_inode+0x0/0xe8) from [<c00be584>]
> >> (generic_delete_inode+0xdc/0x178)
> >>  r4:e4667420 r3:ffffffff
> >> [<c00be4a8>] (generic_delete_inode+0x0/0x178) from [<c00be640>]
> >> (generic_drop_inode+0x20/0x68)
> >>  r5:00000000 r4:e4667420
> >> [<c00be620>] (generic_drop_inode+0x0/0x68) from [<c00bd368>] (iput+0x6c/0x7c)
> >>  r4:e4667420 r3:c00be620
> >> [<c00bd2fc>] (iput+0x0/0x7c) from [<c00b4c40>] (do_unlinkat+0xfc/0x154)
> >>  r4:e4667420 r3:00000000
> >> [<c00b4b44>] (do_unlinkat+0x0/0x154) from [<c00b4cb0>] (sys_unlink+0x18/0x1c)
> >>  r7:0000000a r6:00000000 r5:00000000 r4:be90299b
> >> [<c00b4c98>] (sys_unlink+0x0/0x1c) from [<c001ef80>] (ret_fast_syscall+0x0/0x30)
> >> xfs_force_shutdown(sda1,0x8) called from line 4047 of file
> >> fs/xfs/xfs_bmap.c.  Return address = 0xc015ffec
> >> Filesystem "sda1": Corruption of in-memory data detected.  Shutting
> >> down filesystem: sda1
> >> Please umount the filesystem, and rectify the problem(s)

I've asked this before: please trim/paste your stack traces so they
don't line wrap and are human readable.

[<c0023000>] (dump_backtrace+0x0/0x110)
[<c02dd668>] (dump_stack+0x0/0x1c)
[<c0176b84>] (xfs_error_report+0x0/0x5c)
[<c0150cd4>] (xfs_free_ag_extent+0x0/0x600)
[<c0152b78>] (xfs_free_extent+0x0/0xa4)
[<c015fea0>] (xfs_bmap_finish+0x0/0x194)
[<c017e664>] (xfs_itruncate_finish+0x0/0x30c)
[<c0197bbc>] (xfs_inactive+0x0/0x40c)
[<c01a3d50>] (xfs_fs_clear_inode+0x0/0x60)
[<c00bdcf8>] (clear_inode+0x0/0xe8)
[<c00be4a8>] (generic_delete_inode+0x0/0x178)
[<c00be620>] (generic_drop_inode+0x0/0x68)
[<c00bd2fc>] (iput+0x0/0x7c)
[<c00b4b44>] (do_unlinkat+0x0/0x154)
[<c00b4c98>] (sys_unlink+0x0/0x1c)

So, you powered off an active machine while writing to it, and after
it started back up it hit a free space between corruption. And then
you couldn't mount it because log replay was trying to replay the
last committed transaction to the log. That transaction shows inode
132 being unlinked, added to the AGI unliked list, and then being
inactivated. There is an EFI committed for 1 extent. There is no EFD
committed, so the shutdown occurred during that operation. Log
replay then hits the corruption repeatedly by trying to replay the
EFI to complete the extent free operation.

So, the log and the repair output are useless for determining what
caused the problem - you need the log from the mount *before* the
first shutdown occurred, and to have run repair *before* you
tried to unlink anything. IOWs, if you are doing power fail testing,
you need to test the validity of your filesystems before you do
anything else on them. e.g. Once powered back up, copy the log
before mounting the filesystem, then mount it to replay the log,
then unmount it and run xfs_repair -n to check it. That way you'll
catch any problem caused by the power loss and have some hope of
determining what caused it because you preserved the original
log....

However, seeing as this was due to power failure, I have to ask the
obvious question: does you hardware correctly support barriers/cache
flush/FUA operations and are they turned on?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs