Re: XFS internal error (memory corruption)

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 5 Jul 2011 23:09:32 +1000

On Tue, Jul 05, 2011 at 09:03:19AM +0300, Török Edwin wrote:
> Hi,
> 
> Yesterday when running 'shutdown -Pfh now', it hung using 99% CPU in sys [*]
> Looking at the console there was a message about XFS "Corruption of in-memory data detected", and about XFS_WANT_CORRUPTED_GOTO.

So you had a btree corruption.

> Had to shutdown the machine via SysRQ u + o.
> 
> Today when I booted I got this message:
> [    9.786494] XFS (md1p2): Mounting Filesystem
> [    9.927590] XFS (md1p2): Starting recovery (logdev: /dev/disk/by-id/scsi-SATA_WDC_WD740ADFD-0_WD-WMARF1007797-part5)
> [   10.385941] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1638 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff8122b80e
> [   10.385943]
> [   10.386007] Pid: 1990, comm: mount Not tainted 3.0.0-rc5 #155
> [   10.386009] Call Trace:
> [   10.386014]  [<ffffffff812551ca>] xfs_error_report+0x3a/0x40
> [   10.386017]  [<ffffffff8122b80e>] ? xfs_free_extent+0xce/0x120
> [   10.386019]  [<ffffffff81227e06>] ? xfs_alloc_lookup_eq+0x16/0x20
> [   10.386021]  [<ffffffff81228f4a>] xfs_free_ag_extent+0x6aa/0x780
> [   10.386023]  [<ffffffff8122b80e>] xfs_free_extent+0xce/0x120
> [   10.386026]  [<ffffffff8127b0ff>] ? kmem_zone_alloc+0x5f/0xe0
> [   10.386029]  [<ffffffff81268e9f>] xlog_recover_process_efi+0x15f/0x1a0
> [   10.386031]  [<ffffffff8126ab26>] xlog_recover_process_efis.isra.4+0x76/0xc0
> [   10.386033]  [<ffffffff8126de62>] xlog_recover_finish+0x22/0xc0
> [   10.386035]  [<ffffffff81265aa4>] xfs_log_mount_finish+0x24/0x30
> [   10.386038]  [<ffffffff81270aab>] xfs_mountfs+0x45b/0x720
> [   10.386040]  [<ffffffff81288741>] xfs_fs_fill_super+0x1f1/0x2e0
> [   10.386042]  [<ffffffff811573aa>] mount_bdev+0x1aa/0x1f0
> [   10.386044]  [<ffffffff81288550>] ? xfs_parseargs+0xb90/0xb90
> [   10.386046]  [<ffffffff812866b0>] xfs_fs_mount+0x10/0x20
> [   10.386048]  [<ffffffff81157c3e>] mount_fs+0x3e/0x1b0
> [   10.386051]  [<ffffffff81171807>] vfs_kern_mount+0x57/0xa0
> [   10.386052]  [<ffffffff81171c4f>] do_kern_mount+0x4f/0x100
> [   10.386054]  [<ffffffff811732dc>] do_mount+0x19c/0x840
> [   10.386057]  [<ffffffff8110fa12>] ? __get_free_pages+0x12/0x50
> [   10.386059]  [<ffffffff81172fc5>] ? copy_mount_options+0x35/0x170
> [   10.386061]  [<ffffffff81173d0b>] sys_mount+0x8b/0xe0
> [   10.386064]  [<ffffffff814c19fb>] system_call_fastpath+0x16/0x1b
> [   10.386071] XFS (md1p2): Failed to recover EFIs
> [   10.386097] XFS (md1p2): log mount finish failed
> [   10.428562] XFS (md1p3): Mounting Filesystem
> [   10.609949] XFS (md1p3): Ending clean mount
> 
> FWIW I got a message about EFIs yesterday too, but everything else worked:
> Jul  4 09:42:54 debian kernel: [   11.439861] XFS (md1p2): Mounting Filesystem
> Jul  4 09:42:54 debian kernel: [   11.599815] XFS (md1p2): Starting recovery (logdev: /dev/disk/by-id/scsi-SATA_WDC_WD740ADFD-0_WD-WMARF1007797-part5)
> Jul  4 09:42:54 debian kernel: [   11.787980] XFS (md1p2): I/O error occurred: meta-data dev md1p2 block 0x117925a8       ("xfs_trans_read_buf") error 5 buf c
> ount 4096
> Jul  4 09:42:54 debian kernel: [   11.788044] XFS (md1p2): Failed to recover EFIs
> Jul  4 09:42:54 debian kernel: [   11.788065] XFS (md1p2): log mount finish failed
> Jul  4 09:42:54 debian kernel: [   11.831077] XFS (md1p3): Mounting Filesystem
> Jul  4 09:42:54 debian kernel: [   12.009647] XFS (md1p3): Ending clean mount

Looks like you might have a dying disk. That's a IO error on read
that has been reported back to XFS, and it warned that bad things
happened. Maybe XFS should have shut down, though.

> UUID=6f7c65b9-40b2-4b05-9157-522a67f65c4a /mnt/var_data   xfs     defaults,noatime,nodiratime,logbufs=8,logbsize=256k,logdev=/dev/disk/by-id/scsi-SATA_WDC_WD740ADFD-0_WD-WMARF1007797-part5 0 2
> 
> I can't mount the FS anymore:
> mount: Structure needs cleaning

Obviously - you've got corrupted free space btrees thanks to the IO
error during recovery and the later operations that were done on it.
Now log recovery can't complete without hitting those corruptions.

> So I used xfs_repair /dev/md1p2 -l /dev/sdi5 -L, and then I could mount the log.

Did you replace the faulty disk? If not,this will just happen
again...

> I did save the faulty log-file, let me know if you need it:
> -rw-r--r-- 1 edwin edwin 2.9M Jul  5 09:00 sdi5.xz
> 
> This is on a 3.0-rc5 kernel, my .config is below:
> 
> I've run perf top with the hung shutdown, and it showed me something like this:
>      1964.00 16.3% filemap_fdatawait_range         /lib/modules/3.0.0-rc5/build/vmlinux
>              1831.00 15.2% _raw_spin_lock                  /lib/modules/3.0.0-rc5/build/vmlinux
>              1316.00 10.9% iput                            /lib/modules/3.0.0-rc5/build/vmlinux
>              1265.00 10.5% _atomic_dec_and_lock            /lib/modules/3.0.0-rc5/build/vmlinux
>               998.00  8.3% _raw_spin_unlock                /lib/modules/3.0.0-rc5/build/vmlinux
>               731.00  6.1% sync_inodes_sb                  /lib/modules/3.0.0-rc5/build/vmlinux
>               724.00  6.0% find_get_pages_tag              /lib/modules/3.0.0-rc5/build/vmlinux
>               580.00  4.8% radix_tree_gang_lookup_tag_slot /lib/modules/3.0.0-rc5/build/vmlinux
>               525.00  4.3% __rcu_read_unlock               /lib/modules/3.0.0-rc5/build/vmlinux

Looks like it is running around trying to write back data, stuck
somewhere in the code outside XFS. I haven't seen anything like this
before.

Still, the root cause is likely a bad disk or driver, so finding and
fixing that is probably the first thing you should do...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs