On Tue, Aug 17, 2010 at 08:05:35PM +0300, Arto Jantunen wrote: > Dave Chinner <david@xxxxxxxxxxxxx> writes: > >> I had a kernel BUG yesterday when running xfs_fsr on my Debian Unstable > >> laptop. The kernel is upstream 2.6.35.1. I'm attaching the backtrace > >> below. I haven't tried reproducing the problem yet and don't know if it is > >> reproducible. I can try that, and test patches etc. if it is useful. Let me > >> know if there is any other information I can provide to help with debugging. > > > > It's not obvious what has gone wrong at all - I haven't seen > > anything like this in all my recent testing, so it's something new. > > The first oops implies the inode has not been joined to the > > transaction, but from code inspection I cannot see how that can > > happen. > > I tried to reproduce the problem, and this time xfs_fsr finished without > reporting errors, but the kernel output the following two lines (one of which > is essentially empty): > > [ 6372.878945] Filesystem "sda4": Access to block zero in inode 67203861 > start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 2 > [ 6372.878950] That's a corrupt extent record - it's all zeros, and judging by the fact that it's only got 2 extents, it's probaly inline in the inode (i.e. the inode fork has been zeroed.) > > I decided to boot from a usb stick and try xfs_repair -n, I have attached the > output of that. There were errors reported. Is this simply a case of random > (possibly hardware related) fs corruption, or were the errors actually caused > by the xfs_fsr run that crashed the system? Is there a way to tell from this > data, is there anything else I can provide? .... > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan (but don't clear) agi unlinked lists... > error following ag 0 unlinked list > error following ag 2 unlinked list > error following ag 3 unlinked list Ok, so a corrupt set of inode unlinked lists > - process known inodes and perform inode discovery... > - agno = 0 > b766fb90: Badness in key lookup (length) > bp=(bno 208, len 16384 bytes) key=(bno 208, len 8192 bytes) > b766fb90: Badness in key lookup (length) > bp=(bno 720, len 16384 bytes) key=(bno 720, len 8192 bytes) [snip] > Phase 6 - check inode connectivity... > - traversing filesystem ... > - traversal finished ... > - moving disconnected inodes to lost+found ... > disconnected inode 475, would move to lost+found > disconnected inode 1457, would move to lost+found [snip] > Phase 7 - verify link counts... > would have reset inode 475 nlinks from 0 to 1 > would have reset inode 1457 nlinks from 0 to 1 Ok, so inode #457 is in the inode chunk at block 208, likewise inode #1457 is in the chunk at bno 720. This all implies that at some point there's been a problem with the second phase of the unlink procedure and freeing the inode cluster. It looks like the inode cluster has been partially freed (by the "Badness in key lookup" errors) as half of the chunk is free space and half appears to be in use. The freespace btree is clearly confused about this. Along with the inodes bein removed from the directory structure and the link counts being zero, this really does indicate that something went wrong with an inode cluster freeing transaction at some point. I can't see how normal execution would do this, so it leads me to think that transaction recovery might be involved. It smells like partial transaction recovery failures so my next question is this: what is your hardware, have you had any power loss events and are you using barriers? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs