Re: storage, libaio, or XFS problem? 3.4.26

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 28 Aug 2014 10:32:27 +1000

On Tue, Aug 26, 2014 at 12:19:43PM -0500, Stan Hoeppner wrote:
> On Tue, 26 Aug 2014 17:53:45 +1000, Dave Chinner <david@xxxxxxxxxxxxx>
> wrote:
> > On Tue, Aug 26, 2014 at 01:18:50AM -0500, Stan Hoeppner wrote:
> >> Had some controller issues but believe we had those ironed out before
> >> this
> >> recent breakage.  I had reformatted both 48TB LUNs on this test box
> with
> >> -f
> >> and defaults, and fired up the test app again.  Throughput was
> fantastic
> >> with no dropped application buffers for ~30 minutes, and IO times were
> >> 1.x
> >> ms max, suggesting all the writes were acked by controller cache.  The
> >> app
> >> was running perfectly.  Then it hanged and spit out an internal error
> >> msg:
> >> 
> >> 
> >> Failed to reopen file /mnt/VOL1/sg-04/str-0015/f-0000000005
> >> 
> >> I have many call traces in dmesg, most are XFS, pasted below.  I found
> no
> >> SCSI I/O errors in dmesg.  Application is submitting writes via libaio
> >> and
> >> using O_DIRECT, writing to preallocated files.  I don't see any errors
> in
> >> the storage controller log and everything seems to check out there. 
> >> Using
> >> noop elevator, mpt2sas, no multipathing.  The application process is
> hung
> >> in d state and kill -9 won't get rid of it.  I can't unmount the hosed
> up
> >> filesystem.  Any ideas?
> >> 
> >> # ls -la /mnt/VOL1
> >> ls: cannot access /mnt/VOL1: Input/output error
> >> 
> >> # dd if=/mnt/VOL1 of=/dev/null bs=1M count=16
> >> dd: opening `/mnt/VOL1': Input/output error
> >> 
> >> # dd if=/dev/sdd of=/dev/null bs=1M count=16
> >> 16+0 records in
> >> 16+0 records out
> >> 16777216 bytes (17 MB) copied, 1.10989 s, 15.1 MB/s
> >> 
> >> # ls -la /mnt/
> >> ls: cannot access /mnt/VOL1: Input/output error
> >> total 8
> >> drwxrwxrwt  6 root root  120 Aug 25 17:59 .
> >> drwxr-xr-x 21 root root 4096 Jul 15 09:39 ..
> >> drwxrwxrwt  3 root root   80 Aug 25 16:52 initramfs
> >> drwxr-xr-x  3 root root 4096 Apr 24 04:57 scratch
> >> drwxrwxrwx  5 root root   58 Aug 25 22:06 VOL0
> >> d?????????  ? ?    ?       ?            ? VOL1
> > 
> > That's an inode we failed to stat() - most likely because of the IO
> > error. Has the filesystem shut down? the IO error shoul dhave had
> > some kind of output in dmesg associated with it from XFS....
> 
> Sorry, I omitted the XFS logging preceding the first call trace:
> 
> Aug 25 18:02:50 Anguish-ssu-1 kernel: [ 4280.310304] SGI XFS with security
> attributes, large block/inode numbers, no debug enabled
> Aug 25 18:02:50 Anguish-ssu-1 kernel: [ 4280.311101] XFS (sdc): Mounting
> Filesystem
> Aug 25 18:02:51 Anguish-ssu-1 kernel: [ 4280.501405] XFS (sdc): Starting
> recovery (logdev: internal)
> Aug 25 18:02:53 Anguish-ssu-1 kernel: [ 4282.766547] XFS (sdc): Failed to
> recover EFIs
> Aug 25 18:02:53 Anguish-ssu-1 kernel: [ 4282.766549] XFS (sdc): log mount
> finish failed

That's indicative of corrupted free space btrees.

> Aug 25 18:04:50 Anguish-ssu-1 kernel: [ 4399.471829] XFS (sdc): Mounting
> Filesystem
> Aug 25 18:04:50 Anguish-ssu-1 kernel: [ 4399.620805] XFS (sdc): Ending
> clean mount
> Aug 25 18:04:56 Anguish-ssu-1 kernel: [ 4405.613778] XFS (sdd): Mounting
> Filesystem
> Aug 25 18:04:56 Anguish-ssu-1 kernel: [ 4405.835412] XFS (sdd): Ending
> clean mount
> Aug 25 23:05:39 Anguish-ssu-1 kernel: [22409.328839] XFS (sdd):
> xfs_do_force_shutdown(0x8) called from line 3732 of file fs/xfs/xfs_bmap.c.
> Return address = 0xffffffffa01cc9a6

Yup, that's kinda important. That's from xfs_bmap_finish(), and
freeing an extent has failed and triggered SHUTDOWN_CORRUPT_INCORE
which it's found some kind of inconsistency in the free space
btrees. So, likely the same problem that caused EFI recovery to fail
on the other volume.

Are the tests being run on newly made filesystems? If not, have
these filesystems had xfs_repair run on them after a failure?  If
so, what is the error that is fixed? If not, does repairing the
filesystem make the problem go away?

> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307128] XFS (sdd): failed to
> update timestamps for inode 0x2a000000a4
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307484] XFS (sdd): failed to
> update timestamps for inode 0x29000000a4
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307487] XFS (sdd): failed to
> update timestamps for inode 0x20000009f
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307499] XFS (sdd): failed to
> update timestamps for inode 0x29000000a4
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307693] XFS (sdd): failed to
> update timestamps for inode 0x20000009f
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.368106] XFS (sdd): failed to
> update timestamps for inode 0xf000000a4
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.369785] XFS (sdd): failed to
> update timestamps for inode 0xf000000a4
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.605835] XFS (sdd): failed to
> update timestamps for inode 0x2810f413c
> Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.606169] XFS (sdd): failed to
> update timestamps for inode 0x60000009f

And that is interesting. Makes me wonder if the inode is getting
unlocked on transaction commit failure, or whether there's some
other path in the shutdown code that is not unlocking the inode
correctly.

> Again, when the app locked up I was assuming we had more controller
> issues, and I expected to find SCSI I/O errors preceding the XFS errors. 
> But there are none.

XFs has had some kind of internal error in extent or freespace
management. The lack of output when the error has triggered makes it
impossible to determine what might have gone wrong. Seeing what
xfs_repair -n says about the filesystem woul dbe interesting.
Probably best to use an xfs-repair from the current 3.2.1 release,
because older 3.1.x versions (i think 3.1.10 and prior) didn't
validate the freespace btrees - they just got rebuilt.

> These aren't truly lab systems in all respects.  I don't have the
> permission/access to change the kernel config and recompile.  There is a
> "process" for that and a different group owns that process.  If this is
> necessary to troubleshoot this then I'll see if I can push it through.

Good luck, it sounds like you might have a few stories for the
Daily WTF if the "processes" at that so entrenched it's impossible
to test fixes.

> > I haven't cheked if this still works on a 3.4 kernel, but it will
> > change the rwsem implementation to the generic, spinlock based
> > implementation rather than the super-special, highly optimised
> > x86 specific implementation. If that makes the problem go away,
> > then we've got another rwsem bug on our hands. If it doesn't, then I
> > can probably get you the mrlock debug code lachlan wrote and we can
> > see if XFS is doing something wrong...
> 
> Thanks for the detailed explanation.  Given the new information I've
> provided, what is my next step?

xfs_repair output, turn /proc/sys/fs/xfs/error_level up to 11,
maybe run a test with a CONFIG_XFS_DEBUG=y kernel, or even a
current 3.15/3.16 kernel to see if the problem still exists...

But if you can't get new kernels or xfsprogs binaries onto the
machine, then there is very little that we can do.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs