Re: Kernel BUG when running xfs_fsr with 2.6.35.1

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 18 Aug 2010 09:03:57 +1000

On Tue, Aug 17, 2010 at 08:05:35PM +0300, Arto Jantunen wrote:
> Dave Chinner <david@xxxxxxxxxxxxx> writes:
> >> I had a kernel BUG yesterday when running xfs_fsr on my Debian Unstable
> >> laptop. The kernel is upstream 2.6.35.1. I'm attaching the backtrace
> >> below. I haven't tried reproducing the problem yet and don't know if it is
> >> reproducible. I can try that, and test patches etc. if it is useful. Let me
> >> know if there is any other information I can provide to help with debugging.
> >
> > It's not obvious what has gone wrong at all - I haven't seen
> > anything like this in all my recent testing, so it's something new.
> > The first oops implies the inode has not been joined to the
> > transaction, but from code inspection I cannot see how that can
> > happen.
> 
> I tried to reproduce the problem, and this time xfs_fsr finished without
> reporting errors, but the kernel output the following two lines (one of which
> is essentially empty):
> 
> [ 6372.878945] Filesystem "sda4": Access to block zero in inode 67203861
> start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 2
> [ 6372.878950]

That's a corrupt extent record - it's all zeros, and judging by the
fact that it's only got 2 extents, it's probaly inline in the inode
(i.e. the inode fork has been zeroed.)

> 
> I decided to boot from a usb stick and try xfs_repair -n, I have attached the
> output of that. There were errors reported. Is this simply a case of random
> (possibly hardware related) fs corruption, or were the errors actually caused
> by the xfs_fsr run that crashed the system? Is there a way to tell from this
> data, is there anything else I can provide?
....

> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan (but don't clear) agi unlinked lists...
> error following ag 0 unlinked list
> error following ag 2 unlinked list
> error following ag 3 unlinked list

Ok, so a corrupt set of inode unlinked lists

>         - process known inodes and perform inode discovery...
>         - agno = 0
> b766fb90: Badness in key lookup (length)
> bp=(bno 208, len 16384 bytes) key=(bno 208, len 8192 bytes)
> b766fb90: Badness in key lookup (length)
> bp=(bno 720, len 16384 bytes) key=(bno 720, len 8192 bytes)

[snip]

> Phase 6 - check inode connectivity...
>         - traversing filesystem ...
>         - traversal finished ...
>         - moving disconnected inodes to lost+found ...
> disconnected inode 475, would move to lost+found
> disconnected inode 1457, would move to lost+found

[snip]

> Phase 7 - verify link counts...
> would have reset inode 475 nlinks from 0 to 1
> would have reset inode 1457 nlinks from 0 to 1

Ok, so inode #457 is in the inode chunk at block 208, likewise
inode #1457 is in the chunk at bno 720. This all implies that
at some point there's been a problem with the second phase of
the unlink procedure and freeing the inode cluster. It looks like
the inode cluster has been partially freed (by the "Badness in key
lookup" errors) as half of the chunk is free space and half appears
to be in use. The freespace btree is clearly confused about this.

Along with the inodes bein removed from the directory structure and
the link counts being zero, this really does indicate that something
went wrong with an inode cluster freeing transaction at some point.

I can't see how normal execution would do this, so it leads me to
think that transaction recovery might be involved. It smells like
partial transaction recovery failures so my next question is this:
what is your hardware, have you had any power loss events and are
you using barriers?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs