Re: bug in xfs: can't recovery metadata log

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Tue, 7 Jun 2011 06:32:52 -0400

On Tue, Jun 07, 2011 at 01:20:23PM +0800, Drunkard Zhang wrote:
> The log recovery failure happened after a hard reboot, I did "mount
> /dev/lg/log /mnt/temp/" twice, but the similar dmesg error.
> 
> The xfs lives on LVM, with 4x2TB SATA II disk.
> 
> The first time:
> [ 1479.130446] XFS mounting filesystem dm-0
> [ 1479.226525] Starting XFS recovery on filesystem: dm-0 (logdev: internal)
> [ 1506.217842] BUG: unable to handle kernel NULL pointer dereference
> at 00000000000000f8

[...]

> [ 1506.220989] RIP: 0010:[<ffffffff81235f9c>]  [<ffffffff81235f9c>]
> xfs_cmn_err+0x6b/0x92

[...]

> [ 1506.226301]  [<ffffffff8122922b>] ? kmem_zone_zalloc+0x1f/0x30
> [ 1506.226549]  [<ffffffff812098b5>] xfs_error_report+0x39/0x40
> [ 1506.226805]  [<ffffffff811e8340>] ? xfs_free_extent+0x8e/0xae
> [ 1506.227056]  [<ffffffff811e75cf>] xfs_free_ag_extent+0x3e7/0x70b
> [ 1506.227306]  [<ffffffff811e8340>] xfs_free_extent+0x8e/0xae

It looks like you hit one of the XFS_WANT_CORRUPTED_GOTO checks in
xfs_error_report, and we hit something in there that isn't initialized
that early during the mount process.  My guess it's actually the
mp->m_fsname dereference in xfs_fs_vcmn_err.  It's fixed by the message
rework in 2.6.39+, but that will only prevent the crash, you'll still
get an error and the log recovery will be aborted.  If you can get a
more recent kernel on the box I'd be curious what the output form it is.

Did you run older kernels on this machine before?  Before 2.6.33 device
mapper support for barriers (aka cache flushes) was incomplete and
frequently led to free space corruption if people left the volatile
write caches on.  For MD underneath it event took a bit longer.

If you just want to continue using the filesystem you can nuke the
log using xfs_repair -L.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs