[Amit - please don't top post. If you are going to quote the previous email, please reply after the quoted text. ] [Alex - I haven't seen either of the previous two emails in this thread from Amit - has SGI made it onto a spam-blocking RBL again? ] On Fri, Jul 22, 2011 at 12:29:21PM +0530, Amit Sahrawat wrote: > On Fri, Jul 22, 2011 at 10:53 AM, Amit Sahrawat > <amit.sahrawat83@xxxxxxxxx> wrote: > > On Fri, Jul 22, 2011 at 10:22 AM, Amit Sahrawat > > <amit.sahrawat83@xxxxxxxxx> wrote: > >> Dear All, > >> > >> Target : ARM > >> > >> Recently I encountered a corruption on XFS for RC-3. While the > >> DIRECT-IO for a file was in operation (Write operation) there was a > >> power reset - Only one file at a time is being written to the disk > >> using DIO.. After reboot on mounting I just tried to remove the file > >> and encountered the below mentioned corruption. The hard disk is not > >> able to mount after this, only after clearing logs (xfs_repair –L) – Lots of weird characters in your email... > >> disk is able to mount > >> XFS mounting filesystem sda1 > >> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1535 of file > >> fs/xfs/xfs_alloc.c. Caller 0xc0152c04 > >> Backtrace: > >> [<c0023000>] (dump_backtrace+0x0/0x110) from [<c02dd680>] (dump_stack+0x18/0x1c) > >> r6:00000000 r5:c0152c04 r4:00000075 r3:e3ec1c88 > >> [<c02dd668>] (dump_stack+0x0/0x1c) from [<c0176bd0>] > >> (xfs_error_report+0x4c/0x5c) > >> [<c0176b84>] (xfs_error_report+0x0/0x5c) from [<c01510d4>] > >> (xfs_free_ag_extent+0x400/0x600) > >> [<c0150cd4>] (xfs_free_ag_extent+0x0/0x600) from [<c0152c04>] > >> (xfs_free_extent+0x8c/0xa4) > >> [<c0152b78>] (xfs_free_extent+0x0/0xa4) from [<c015ffa8>] > >> (xfs_bmap_finish+0x108/0x194) > >> r7:e3ec1e10 r6:00000000 r5:e3737870 r4:e373e000 > >> [<c015fea0>] (xfs_bmap_finish+0x0/0x194) from [<c017e840>] > >> (xfs_itruncate_finish+0x1dc/0x30c) > >> [<c017e664>] (xfs_itruncate_finish+0x0/0x30c) from [<c0197dc8>] > >> (xfs_inactive+0x20c/0x40c) > >> [<c0197bbc>] (xfs_inactive+0x0/0x40c) from [<c01a3da0>] > >> (xfs_fs_clear_inode+0x50/0x60) > >> r9:e3ec0000 r8:c001f128 r7:00000000 r6:e4671a80 r5:c0312454 > >> r4:e4667300 > >> [<c01a3d50>] (xfs_fs_clear_inode+0x0/0x60) from [<c00bdd84>] > >> (clear_inode+0x8c/0xe8) > >> r4:e4667420 r3:c01a3d50 > >> [<c00bdcf8>] (clear_inode+0x0/0xe8) from [<c00be584>] > >> (generic_delete_inode+0xdc/0x178) > >> r4:e4667420 r3:ffffffff > >> [<c00be4a8>] (generic_delete_inode+0x0/0x178) from [<c00be640>] > >> (generic_drop_inode+0x20/0x68) > >> r5:00000000 r4:e4667420 > >> [<c00be620>] (generic_drop_inode+0x0/0x68) from [<c00bd368>] (iput+0x6c/0x7c) > >> r4:e4667420 r3:c00be620 > >> [<c00bd2fc>] (iput+0x0/0x7c) from [<c00b4c40>] (do_unlinkat+0xfc/0x154) > >> r4:e4667420 r3:00000000 > >> [<c00b4b44>] (do_unlinkat+0x0/0x154) from [<c00b4cb0>] (sys_unlink+0x18/0x1c) > >> r7:0000000a r6:00000000 r5:00000000 r4:be90299b > >> [<c00b4c98>] (sys_unlink+0x0/0x1c) from [<c001ef80>] (ret_fast_syscall+0x0/0x30) > >> xfs_force_shutdown(sda1,0x8) called from line 4047 of file > >> fs/xfs/xfs_bmap.c. Return address = 0xc015ffec > >> Filesystem "sda1": Corruption of in-memory data detected. Shutting > >> down filesystem: sda1 > >> Please umount the filesystem, and rectify the problem(s) I've asked this before: please trim/paste your stack traces so they don't line wrap and are human readable. [<c0023000>] (dump_backtrace+0x0/0x110) [<c02dd668>] (dump_stack+0x0/0x1c) [<c0176b84>] (xfs_error_report+0x0/0x5c) [<c0150cd4>] (xfs_free_ag_extent+0x0/0x600) [<c0152b78>] (xfs_free_extent+0x0/0xa4) [<c015fea0>] (xfs_bmap_finish+0x0/0x194) [<c017e664>] (xfs_itruncate_finish+0x0/0x30c) [<c0197bbc>] (xfs_inactive+0x0/0x40c) [<c01a3d50>] (xfs_fs_clear_inode+0x0/0x60) [<c00bdcf8>] (clear_inode+0x0/0xe8) [<c00be4a8>] (generic_delete_inode+0x0/0x178) [<c00be620>] (generic_drop_inode+0x0/0x68) [<c00bd2fc>] (iput+0x0/0x7c) [<c00b4b44>] (do_unlinkat+0x0/0x154) [<c00b4c98>] (sys_unlink+0x0/0x1c) So, you powered off an active machine while writing to it, and after it started back up it hit a free space between corruption. And then you couldn't mount it because log replay was trying to replay the last committed transaction to the log. That transaction shows inode 132 being unlinked, added to the AGI unliked list, and then being inactivated. There is an EFI committed for 1 extent. There is no EFD committed, so the shutdown occurred during that operation. Log replay then hits the corruption repeatedly by trying to replay the EFI to complete the extent free operation. So, the log and the repair output are useless for determining what caused the problem - you need the log from the mount *before* the first shutdown occurred, and to have run repair *before* you tried to unlink anything. IOWs, if you are doing power fail testing, you need to test the validity of your filesystems before you do anything else on them. e.g. Once powered back up, copy the log before mounting the filesystem, then mount it to replay the log, then unmount it and run xfs_repair -n to check it. That way you'll catch any problem caused by the power loss and have some hope of determining what caused it because you preserved the original log.... However, seeing as this was due to power failure, I have to ask the obvious question: does you hardware correctly support barriers/cache flush/FUA operations and are they turned on? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs