Re: storage, libaio, or XFS problem? 3.4.26

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Tue, 26 Aug 2014 12:19:43 -0500

On Tue, 26 Aug 2014 17:53:45 +1000, Dave Chinner <david@xxxxxxxxxxxxx>
wrote:
> On Tue, Aug 26, 2014 at 01:18:50AM -0500, Stan Hoeppner wrote:
>> Had some controller issues but believe we had those ironed out before
>> this
>> recent breakage.  I had reformatted both 48TB LUNs on this test box
with
>> -f
>> and defaults, and fired up the test app again.  Throughput was
fantastic
>> with no dropped application buffers for ~30 minutes, and IO times were
>> 1.x
>> ms max, suggesting all the writes were acked by controller cache.  The
>> app
>> was running perfectly.  Then it hanged and spit out an internal error
>> msg:
>> 
>> 
>> Failed to reopen file /mnt/VOL1/sg-04/str-0015/f-0000000005
>> 
>> I have many call traces in dmesg, most are XFS, pasted below.  I found
no
>> SCSI I/O errors in dmesg.  Application is submitting writes via libaio
>> and
>> using O_DIRECT, writing to preallocated files.  I don't see any errors
in
>> the storage controller log and everything seems to check out there. 
>> Using
>> noop elevator, mpt2sas, no multipathing.  The application process is
hung
>> in d state and kill -9 won't get rid of it.  I can't unmount the hosed
up
>> filesystem.  Any ideas?
>> 
>> # ls -la /mnt/VOL1
>> ls: cannot access /mnt/VOL1: Input/output error
>> 
>> # dd if=/mnt/VOL1 of=/dev/null bs=1M count=16
>> dd: opening `/mnt/VOL1': Input/output error
>> 
>> # dd if=/dev/sdd of=/dev/null bs=1M count=16
>> 16+0 records in
>> 16+0 records out
>> 16777216 bytes (17 MB) copied, 1.10989 s, 15.1 MB/s
>> 
>> # ls -la /mnt/
>> ls: cannot access /mnt/VOL1: Input/output error
>> total 8
>> drwxrwxrwt  6 root root  120 Aug 25 17:59 .
>> drwxr-xr-x 21 root root 4096 Jul 15 09:39 ..
>> drwxrwxrwt  3 root root   80 Aug 25 16:52 initramfs
>> drwxr-xr-x  3 root root 4096 Apr 24 04:57 scratch
>> drwxrwxrwx  5 root root   58 Aug 25 22:06 VOL0
>> d?????????  ? ?    ?       ?            ? VOL1
> 
> That's an inode we failed to stat() - most likely because of the IO
> error. Has the filesystem shut down? the IO error shoul dhave had
> some kind of output in dmesg associated with it from XFS....

Sorry, I omitted the XFS logging preceding the first call trace:

Aug 25 18:02:50 Anguish-ssu-1 kernel: [ 4280.310304] SGI XFS with security
attributes, large block/inode numbers, no debug enabled
Aug 25 18:02:50 Anguish-ssu-1 kernel: [ 4280.311101] XFS (sdc): Mounting
Filesystem
Aug 25 18:02:51 Anguish-ssu-1 kernel: [ 4280.501405] XFS (sdc): Starting
recovery (logdev: internal)
Aug 25 18:02:53 Anguish-ssu-1 kernel: [ 4282.766547] XFS (sdc): Failed to
recover EFIs
Aug 25 18:02:53 Anguish-ssu-1 kernel: [ 4282.766549] XFS (sdc): log mount
finish failed
Aug 25 18:04:50 Anguish-ssu-1 kernel: [ 4399.471829] XFS (sdc): Mounting
Filesystem
Aug 25 18:04:50 Anguish-ssu-1 kernel: [ 4399.620805] XFS (sdc): Ending
clean mount
Aug 25 18:04:56 Anguish-ssu-1 kernel: [ 4405.613778] XFS (sdd): Mounting
Filesystem
Aug 25 18:04:56 Anguish-ssu-1 kernel: [ 4405.835412] XFS (sdd): Ending
clean mount
Aug 25 23:05:39 Anguish-ssu-1 kernel: [22409.328839] XFS (sdd):
xfs_do_force_shutdown(0x8) called from line 3732 of file fs/xfs/xfs_bmap.c.
Return address = 0xffffffffa01cc9a6
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307128] XFS (sdd): failed to
update timestamps for inode 0x2a000000a4
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307484] XFS (sdd): failed to
update timestamps for inode 0x29000000a4
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307487] XFS (sdd): failed to
update timestamps for inode 0x20000009f
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307499] XFS (sdd): failed to
update timestamps for inode 0x29000000a4
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.307693] XFS (sdd): failed to
update timestamps for inode 0x20000009f
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.368106] XFS (sdd): failed to
update timestamps for inode 0xf000000a4
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.369785] XFS (sdd): failed to
update timestamps for inode 0xf000000a4
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.605835] XFS (sdd): failed to
update timestamps for inode 0x2810f413c
Aug 25 23:05:49 Anguish-ssu-1 kernel: [22419.606169] XFS (sdd): failed to
update timestamps for inode 0x60000009f

Again, when the app locked up I was assuming we had more controller
issues, and I expected to find SCSI I/O errors preceding the XFS errors. 
But there are none.  The controllers check out, no errors logged on them. 
Diags pass, etc.  When NOT using AIO for submission, just parallel
O_DIRECT, the big rig is sustaining ~2.5GB/s write throughput across 14
arrays at the application level with ~4200 write threads submitting IOs
concurrently, with no problems, no errors.  ~1/3rd of these write threads
are slow writers, each submitting a 32KB IO to a file every few seconds. 
Fast and medium rate streams fill the cache, thus these slow writer streams
generate RMWs.  This is why the throughput is relatively low for 168
effective spindles which are capable of streaming writes in the ~16GB/s
neighborhood.  My little test rig is submitting from 602 threads in
parallel, 301 threads each to one of two filesystems each on a 12+1 RAID5
LUN.  Both boxes have the same problem described here, but only when
submitting with AIO.  I don't have logs for the big box, but I did see some
of the same collateral damage on it, specifically the garbage when doing an
'ls -la' on the mount points.  This obviously due to the XFS forced
shutdown.

>> [22635.102013] INFO: task kworker/7:0:45 blocked for more than 120
>> seconds.
>> [22635.102016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables
>> this message.
>> [22635.102018] kworker/7:0     D ffff8840666c0b08     0    45      2
>> 0x00000000
>> [22635.102021]  ffff8840666e7bd0 0000000000000046 ffff883f7c02d000
>> ffff8840666f5180
>> [22635.102026]  ffff8840666e7b80 0000000000000206 00000000000122c0
>> 00000000000122c0
>> [22635.102030]  ffff8840666e7fd8 ffff8840666e6000 00000000000122c0
>> ffff8840666e6000
>> [22635.102033] Call Trace:
>> [22635.102041]  [<ffffffff814f5fd7>] schedule+0x64/0x66
>> [22635.102044]  [<ffffffff814f66ec>]
rwsem_down_failed_common+0xdb/0x10d
>> [22635.102047]  [<ffffffff814f6731>] rwsem_down_write_failed+0x13/0x15
>> [22635.102051]  [<ffffffff81261913>]
>> call_rwsem_down_write_failed+0x13/0x20
>> [22635.102055]  [<ffffffff814f6a92>] ?
>> _raw_spin_unlock_irqrestore+0x30/0x3d
>> [22635.102058]  [<ffffffff814f5458>] ? down_write+0x25/0x27
>> [22635.102070]  [<ffffffffa01b35e4>] xfs_ilock+0x4f/0xb4 [xfs]
>> [22635.102077]  [<ffffffffa01b5ab0>]
>> xfs_iomap_write_unwritten+0x1b3/0x32e
>> [xfs]
>> [22635.102080]  [<ffffffff814f6a92>] ?
>> _raw_spin_unlock_irqrestore+0x30/0x3d
>> [22635.102084]  [<ffffffffa01ab3d2>] ? xfs_setfilesize+0x128/0x128
[xfs]
>> [22635.102088]  [<ffffffff810bc602>] ? mempool_free+0x73/0x78
>> [22635.102093]  [<ffffffffa01ab3d2>] ? xfs_setfilesize+0x128/0x128
[xfs]
>> [22635.102098]  [<ffffffffa01ab45b>] xfs_end_io+0x89/0xb4 [xfs]
>> [22635.102102]  [<ffffffff81047321>] process_one_work+0x204/0x327
>> [22635.102105]  [<ffffffff8104757f>] worker_thread+0x13b/0x25a
>> [22635.102108]  [<ffffffff81047444>] ? process_one_work+0x327/0x327
>> [22635.102111]  [<ffffffff8104af6f>] kthread+0x89/0x91
>> [22635.102115]  [<ffffffff814fdbd4>] kernel_thread_helper+0x4/0x10
>> [22635.102118]  [<ffffffff8104aee6>] ? __init_kthread_worker+0x3c/0x3c
>> [22635.102120]  [<ffffffff814fdbd0>] ? gs_change+0xb/0xb
> 
> That's stuck on a an inode lock. That reminds of a problem we had in
> RHEL6 with highly concurrent direct IO. The rwsems were buggy, and
> XFS was tripping over the bug and hanging just like this. If the
> filesystem is not shutting down and causing that IO error due to a
> shutdown state, then it's entirely possible we've got another rwsem
> issue.  Now I've just got to go find the RHEL 6 problem again. Ah:

Definitely highly concurrent direct io here.  Filesystem *is* shutting
down.  Again, this only happens when submitting via AIO.

> 91af708 rwsem: Test for no active locks in __rwsem_do_wake undo code
> 
> But that was fixed in 2.6.34 (rhel 6 was based on 2.6.32), so I
> doubt that is your problem. However, it smells almost exactly the
> same - it took about an hour of highly concurrent direct IO to SSDs
> to trigger, and Lachlan ended up finding the right incantation of
> lock debug code inside the xfs mrlock implementation to prove that
> it was an rwsem bug and not an XFS locking issue.

FWIW I'm seeing it after about ~30 minutes on the small box.  I don't know
how quickly it occurred on the big rig.

> Are you able to take crash dumps from this machine so you can dig
> around inside the XFS inode and rwsem states when the system locks
> up like this?

The kernel isn't completely crashing.  I am able to CTRL+Z out of the hung
application back to a prompt, and open additional SSH shells.  But when
these errors occur, something in the kernel does become goofy, as the load
average spikes to the stratosphere, even though there is no CPU burn nor IO
wait.  Once the app crashes the box is 100% idle.  Do keep in mind as we
troubleshoot this that the boxen are in a machine room 520 miles from my
location.  So I can't push magic key sequences on the console.  I don't
have KVM access.  Through I might be able to get it if needed, haven't
asked.

> One thing in the RH bug that was suggested as a simply test to
> determine if there's a problem with the rwsems is to edit the kernel
> configin arch/x86/Kconfig:
> 
> Change:
> 
> 	config RWSEM_GENERIC_SPINLOCK
> 		def_bool !X86_ADD
> 
> 	config RWSEM_XCHGADD_ALGORITHM
> 		def_bool X86_ADD
> 
> to:
> 
> 	config RWSEM_GENERIC_SPINLOCK
> 		def_bool y
> 
> 	config RWSEM_XCHGADD_ALGORITHM
> 		def_bool n

These aren't truly lab systems in all respects.  I don't have the
permission/access to change the kernel config and recompile.  There is a
"process" for that and a different group owns that process.  If this is
necessary to troubleshoot this then I'll see if I can push it through.

> I haven't cheked if this still works on a 3.4 kernel, but it will
> change the rwsem implementation to the generic, spinlock based
> implementation rather than the super-special, highly optimised
> x86 specific implementation. If that makes the problem go away,
> then we've got another rwsem bug on our hands. If it doesn't, then I
> can probably get you the mrlock debug code lachlan wrote and we can
> see if XFS is doing something wrong...

Thanks for the detailed explanation.  Given the new information I've
provided, what is my next step?

Thanks,

Stan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs