Re: Crashing OSDs (suicide timeout, following a single pool)

"Brandon Morris, PMP" <brandon.morris.pmp@xxxxxxxxx> · Wed, 1 Jun 2016 11:00:47 -0600

Adam,

     We ran into similar issues when we get too many objects in bucket (around 300 million).  The .rgw.buckets.index pool became unable to complete backfill operations.    The only way we were able to get past it was to export the offending placement group with the ceph-objectstore-tool and re-import it into another OSD to complete the backfill.  For us, the export operation seemed to hang and took 8 hours to export, so if you do choose to go down this route, be patient.

From your logs, it appears that pg 32.10c is the offending PG on OSD.16.  If you are running into the same issue we did, when you go to export it there will be a file that will hang. For whatever reason the leveldb metadata for that file hangs and causes the backfill operation to suicide the OSD. 

If anyone from the community has an explanation for why this happens I would love to know.  We have run into this twice now on the Infernalis codebase.  We are in the process of rebuilding our cluster to Jewel, so can't say whether or not it happens there as well.

---------
Here is the pertinent lines from your log.

2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663 pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771 n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662) [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1 bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0 undersized+degraded+remapped+backfilling+peered] send_push_op 32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0 recovery_info: ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233, size: 0, copy_subset: [], clone_subset: {})
[...]
2016-06-01 09:27:25.091411 7f34cf856700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
[...]
2016-06-01 09:31:57.201645 7f3510669700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
2016-06-01 09:31:57.201671 7f3510669700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300
common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f35167bb5b5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f35166f7bf1]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c]
 5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b]
 6: (()+0x7dc5) [0x7f35146ecdc5]
 7: (clone()+0x6d) [0x7f3512d77ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

Brandon

On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote:
Hello all,

I'm running into an issue with ceph osds crashing over the last 4

days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.

A little setup information:

26 hosts

2x 400GB Intel DC P3700 SSDs

12x6TB spinning disks

4x4TB spinning disks.

The SSDs are used for both journals and as an OSD (for the cephfs

metadata pool).

We were running Ceph with some success in this configuration

(upgrading ceph from hammer to infernalis to jewel) for the past 8-10

months.

Up through Friday, we were healthy.

Until Saturday. On Saturday, the OSDs on the SSDs started flapping and

then finally dying off, hitting their suicide timeout due to missing

heartbeats. At the time, we were running Infernalis, getting ready to

upgrade to Jewel.

I spent the weekend and Monday, attempting to stabilize those OSDs,

unfortunately failing. As part of the stabilzation attempts, I check

iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,

and overall SMART health of the SSDs, everything looks normal. I

checked to make sure the time was in sync between all hosts.

I also tried to move the metadata pool to the spinning disks (to

remove some dependence on the SSDs, just in case). The suicide timeout

issues followed the pool migration. The spinning disks started timing

out. This was at a time when *all* of client the IOPs to the ceph

cluster were in the low 100's as reported to by ceph -s. I was

restarting failed OSDs as fast as they were dying and I couldn't keep

up. I checked the switches and NICs for errors and drops. No changes

in the frequency of them. We're talking an error every 20-25 minutes.

I would expect network issues to affect other OSDs (and pools) in the

system, too.

On Tuesday, I got together with my coworker, and we tried together to

stabilize the cluster. We finally went into emergency maintenance

mode, as we could not get the metadata pool healthy. We stopped the

MDS, we tried again to let things stabilize, with no client IO to the

pool. Again more suicide timeouts.

Then, we rebooted the ceph nodes, figuring there *might* be something

stuck in a hardware IO queue or cache somewhere. Again more crashes

when the machines came back up.

We figured at this point, there was nothing to lose by performing the

update to Jewel, and, who knows, maybe we were hitting a bug that had

been fixed. Reboots were involved again (kernel updates, too).

More crashes.

I finally decided, that there *could* be an unlikely chance that jumbo

frames might suddenly be an issue (after years of using them with

these switches). I turned down the MTUs on the ceph nodes to the

standard 1500.

More crashes.

We decided to try and let things settle out overnight, with no IO.

That brings us to today:

We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have

crashed due to the suicide timeout. I've tried starting them one at a

time, they're still dying off with suicide timeouts.

I've gathered the logs I could think to:

A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log

CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt

OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt

Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt

At the moment, we're dead in the water. I would appreciate any

pointers to getting this fixed.

--

Adam Tygart

Beocat Sysadmin

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com