Crashing OSDs (suicide timeout, following a single pool)

Adam Tygart <mozes@xxxxxxx> · Wed, 1 Jun 2016 10:13:50 -0500

Hello all,

I'm running into an issue with ceph osds crashing over the last 4
days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.

A little setup information:
26 hosts
2x 400GB Intel DC P3700 SSDs
12x6TB spinning disks
4x4TB spinning disks.

The SSDs are used for both journals and as an OSD (for the cephfs
metadata pool).

We were running Ceph with some success in this configuration
(upgrading ceph from hammer to infernalis to jewel) for the past 8-10
months.

Up through Friday, we were healthy.

Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
then finally dying off, hitting their suicide timeout due to missing
heartbeats. At the time, we were running Infernalis, getting ready to
upgrade to Jewel.

I spent the weekend and Monday, attempting to stabilize those OSDs,
unfortunately failing. As part of the stabilzation attempts, I check
iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,
and overall SMART health of the SSDs, everything looks normal. I
checked to make sure the time was in sync between all hosts.

I also tried to move the metadata pool to the spinning disks (to
remove some dependence on the SSDs, just in case). The suicide timeout
issues followed the pool migration. The spinning disks started timing
out. This was at a time when *all* of client the IOPs to the ceph
cluster were in the low 100's as reported to by ceph -s. I was
restarting failed OSDs as fast as they were dying and I couldn't keep
up. I checked the switches and NICs for errors and drops. No changes
in the frequency of them. We're talking an error every 20-25 minutes.
I would expect network issues to affect other OSDs (and pools) in the
system, too.

On Tuesday, I got together with my coworker, and we tried together to
stabilize the cluster. We finally went into emergency maintenance
mode, as we could not get the metadata pool healthy. We stopped the
MDS, we tried again to let things stabilize, with no client IO to the
pool. Again more suicide timeouts.

Then, we rebooted the ceph nodes, figuring there *might* be something
stuck in a hardware IO queue or cache somewhere. Again more crashes
when the machines came back up.

We figured at this point, there was nothing to lose by performing the
update to Jewel, and, who knows, maybe we were hitting a bug that had
been fixed. Reboots were involved again (kernel updates, too).

More crashes.

I finally decided, that there *could* be an unlikely chance that jumbo
frames might suddenly be an issue (after years of using them with
these switches). I turned down the MTUs on the ceph nodes to the
standard 1500.

More crashes.

We decided to try and let things settle out overnight, with no IO.
That brings us to today:

We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have
crashed due to the suicide timeout. I've tried starting them one at a
time, they're still dying off with suicide timeouts.

I've gathered the logs I could think to:
A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt

At the moment, we're dead in the water. I would appreciate any
pointers to getting this fixed.

--
Adam Tygart
Beocat Sysadmin
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com