Re: Crashing OSDs (suicide timeout, following a single pool)

Mike Lovell <mike.lovell@xxxxxxxxxxxxx> · Wed, 1 Jun 2016 16:20:14 -0600

On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote:
> Hello all,
>
> I'm running into an issue with ceph osds crashing over the last 4
> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.
>
> A little setup information:
> 26 hosts
> 2x 400GB Intel DC P3700 SSDs
> 12x6TB spinning disks
> 4x4TB spinning disks.
>
> The SSDs are used for both journals and as an OSD (for the cephfs
> metadata pool).
>
> We were running Ceph with some success in this configuration
> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10
> months.
>
> Up through Friday, we were healthy.
>
> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
> then finally dying off, hitting their suicide timeout due to missing
> heartbeats. At the time, we were running Infernalis, getting ready to
> upgrade to Jewel.
>
> I spent the weekend and Monday, attempting to stabilize those OSDs,
> unfortunately failing. As part of the stabilzation attempts, I check
> iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,
> and overall SMART health of the SSDs, everything looks normal. I
> checked to make sure the time was in sync between all hosts.
>
> I also tried to move the metadata pool to the spinning disks (to
> remove some dependence on the SSDs, just in case). The suicide timeout
> issues followed the pool migration. The spinning disks started timing
> out. This was at a time when *all* of client the IOPs to the ceph
> cluster were in the low 100's as reported to by ceph -s. I was
> restarting failed OSDs as fast as they were dying and I couldn't keep
> up. I checked the switches and NICs for errors and drops. No changes
> in the frequency of them. We're talking an error every 20-25 minutes.
> I would expect network issues to affect other OSDs (and pools) in the
> system, too.
>
> On Tuesday, I got together with my coworker, and we tried together to
> stabilize the cluster. We finally went into emergency maintenance
> mode, as we could not get the metadata pool healthy. We stopped the
> MDS, we tried again to let things stabilize, with no client IO to the
> pool. Again more suicide timeouts.
>
> Then, we rebooted the ceph nodes, figuring there *might* be something
> stuck in a hardware IO queue or cache somewhere. Again more crashes
> when the machines came back up.
>
> We figured at this point, there was nothing to lose by performing the
> update to Jewel, and, who knows, maybe we were hitting a bug that had
> been fixed. Reboots were involved again (kernel updates, too).
>
> More crashes.
>
> I finally decided, that there *could* be an unlikely chance that jumbo
> frames might suddenly be an issue (after years of using them with
> these switches). I turned down the MTUs on the ceph nodes to the
> standard 1500.
>
> More crashes.
>
> We decided to try and let things settle out overnight, with no IO.
> That brings us to today:
>
> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have
> crashed due to the suicide timeout. I've tried starting them one at a
> time, they're still dying off with suicide timeouts.
>
> I've gathered the logs I could think to:
> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt
>
> At the moment, we're dead in the water. I would appreciate any
> pointers to getting this fixed.
>
> --
> Adam Tygart
> Beocat Sysadmin

i spent some time trying to figure out what is happening from your
osd.16 log but i've run out of time i can spend right now on this.
here is what i think is happening.

if you grep for heartbeat_map in the log, you see that there are
heartbeat timeouts starting at failures for 'OSD::recovery_tp thread
0x7f34c5e41700' which ultimately leads to the osd committing suicide
after a period of time. specifically, the
osd_recovery_thread_suicide_timeout which defaults to 300 seconds.
looking for the thread id of 7f34c5e41700 in the log shows attempts to
do recovery on the PG 32.10c. this is the last log line that seems to
indicate any kind of progress on that recovery.

2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561]
local-les=493771 n=3917 ec=44014 les/c/f 493771/486667/0
497332/497662/497662) [214,143,448]/[16] r=0 lpr=497662
pi=483321-497661/190 rops=1 bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
undersized+degraded+remapped+backfilling+peered] send_push_op
32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0
recovery_info: ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233,
size: 0, copy_subset: [], clone_subset: {})

the other osds in this PG are 143, 214, and 448. looking for osd.143
in the log shows that operations for another PG seem to be completing
until around the 09:26:55.537902 mark where osd.16 says that it and
osd.143 have different versions of the osd map. after that there is
nothing but osd_pings with osd.143 and osd.143 ultimately responding
to osd_ping requests from osd.16 with replies saying that osd.16 is
down in a newer version of the osd map. such as these lines.

2016-06-01 09:27:47.758414 7f34d205b700  1 -- 10.5.38.1:0/127465 <==
osd.143 10.5.38.16:6872/54267 14 ==== osd_ping(ping_reply e497678
stamp 2016-06-01 09:27:47.756205) v2 ==== 47+0+0 (2585196378 0 0)
0x7f353819fe00 con 0x7f3536239500
2016-06-01 09:27:47.758422 7f34d205b700 10 -- 10.5.38.1:6834/127465
get_connection osd.143 10.5.38.16:6821/54267 existing 0x7f353bb7c800
2016-06-01 09:27:47.759628 7f34d205b700  1 -- 10.5.38.1:0/127465 <==
osd.143 10.5.38.16:6823/54267 15 ==== osd_ping(you_died e497678 stamp
2016-06-01 09:27:47.756205) v2 ==== 47+0+0 (3602993411 0 0)
0x7f353ae38800 con 0x7f3536238c00
2016-06-01 09:27:47.759636 7f34d205b700 10 osd.16 497664
handle_osd_ping osd.143 10.5.38.16:6823/54267 says i am down in 497678

going through and looking for lines related to the osd_ping requests
then shows a lot of these lines.

2016-06-01 09:27:25.091417 7f34cf856700 10 osd.16 497664 internal
heartbeat not healthy, dropping ping request

from looking through things, it appears something is happening to an
internal heartbeat which is causing osd.16 to start ignoring osd_ping
requests from the other osds in the cluster. the other osds then no
longer get ping responses and assume the osd is down. the osd map gets
updating saying osd.16 resulting in the the ping results saying
'you_died.'

the things i'm not sure on is what on osd.16 is not healthly so that
it starts dropping ping requests or why recovery of 32.10c just
stopped. it also looks like osd.16 is never updating its osd map
because it keeps using version 497664 until it commits suicide even
though the 'you_died' replies are saying there are newer versions of
the map.

so this probably wasn't that useful but those are the things that
stood out to me. sorry i'm not of much more use at the moment.

mike
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com