Re: Crashing OSDs (suicide timeout, following a single pool)

Adam Tygart <mozes@xxxxxxx> · Wed, 1 Jun 2016 16:14:00 -0500

I've been attempting to work through this, finding the pgs that are
causing hangs, determining if they are "safe" to remove, and removing
them with ceph-objectstore-tool on osd 16.

I'm now getting hangs (followed by suicide timeouts) referencing pgs
that I've just removed, so this doesn't seem to be all there is to the
issue.

--
Adam

On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP
<brandon.morris.pmp@xxxxxxxxx> wrote:
> Adam,
>
>      We ran into similar issues when we get too many objects in bucket
> (around 300 million).  The .rgw.buckets.index pool became unable to complete
> backfill operations.    The only way we were able to get past it was to
> export the offending placement group with the ceph-objectstore-tool and
> re-import it into another OSD to complete the backfill.  For us, the export
> operation seemed to hang and took 8 hours to export, so if you do choose to
> go down this route, be patient.
>
> From your logs, it appears that pg 32.10c is the offending PG on OSD.16.  If
> you are running into the same issue we did, when you go to export it there
> will be a file that will hang. For whatever reason the leveldb metadata for
> that file hangs and causes the backfill operation to suicide the OSD.
>
> If anyone from the community has an explanation for why this happens I would
> love to know.  We have run into this twice now on the Infernalis codebase.
> We are in the process of rebuilding our cluster to Jewel, so can't say
> whether or not it happens there as well.
>
> ---------
> Here is the pertinent lines from your log.
>
> 2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
> pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771
> n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662)
> [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1
> bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
> undersized+degraded+remapped+backfilling+peered] send_push_op
> 32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0
> recovery_info:
> ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233,
> size: 0, copy_subset: [], clone_subset: {})
> [...]
> 2016-06-01 09:27:25.091411 7f34cf856700  1 heartbeat_map is_healthy
> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
> [...]
> 2016-06-01 09:31:57.201645 7f3510669700  1 heartbeat_map is_healthy
> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
> 2016-06-01 09:31:57.201671 7f3510669700  1 heartbeat_map is_healthy
> 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300
> common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const
> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time
> 2016-06-01 09:31:57.201687
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7f35167bb5b5]
>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*, long)+0x2e1) [0x7f35166f7bf1]
>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e]
>  4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c]
>  5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b]
>  6: (()+0x7dc5) [0x7f35146ecdc5]
>  7: (clone()+0x6d) [0x7f3512d77ced]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>
> Brandon
>
>
> On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>>
>> Hello all,
>>
>> I'm running into an issue with ceph osds crashing over the last 4
>> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.
>>
>> A little setup information:
>> 26 hosts
>> 2x 400GB Intel DC P3700 SSDs
>> 12x6TB spinning disks
>> 4x4TB spinning disks.
>>
>> The SSDs are used for both journals and as an OSD (for the cephfs
>> metadata pool).
>>
>> We were running Ceph with some success in this configuration
>> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10
>> months.
>>
>> Up through Friday, we were healthy.
>>
>> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
>> then finally dying off, hitting their suicide timeout due to missing
>> heartbeats. At the time, we were running Infernalis, getting ready to
>> upgrade to Jewel.
>>
>> I spent the weekend and Monday, attempting to stabilize those OSDs,
>> unfortunately failing. As part of the stabilzation attempts, I check
>> iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,
>> and overall SMART health of the SSDs, everything looks normal. I
>> checked to make sure the time was in sync between all hosts.
>>
>> I also tried to move the metadata pool to the spinning disks (to
>> remove some dependence on the SSDs, just in case). The suicide timeout
>> issues followed the pool migration. The spinning disks started timing
>> out. This was at a time when *all* of client the IOPs to the ceph
>> cluster were in the low 100's as reported to by ceph -s. I was
>> restarting failed OSDs as fast as they were dying and I couldn't keep
>> up. I checked the switches and NICs for errors and drops. No changes
>> in the frequency of them. We're talking an error every 20-25 minutes.
>> I would expect network issues to affect other OSDs (and pools) in the
>> system, too.
>>
>> On Tuesday, I got together with my coworker, and we tried together to
>> stabilize the cluster. We finally went into emergency maintenance
>> mode, as we could not get the metadata pool healthy. We stopped the
>> MDS, we tried again to let things stabilize, with no client IO to the
>> pool. Again more suicide timeouts.
>>
>> Then, we rebooted the ceph nodes, figuring there *might* be something
>> stuck in a hardware IO queue or cache somewhere. Again more crashes
>> when the machines came back up.
>>
>> We figured at this point, there was nothing to lose by performing the
>> update to Jewel, and, who knows, maybe we were hitting a bug that had
>> been fixed. Reboots were involved again (kernel updates, too).
>>
>> More crashes.
>>
>> I finally decided, that there *could* be an unlikely chance that jumbo
>> frames might suddenly be an issue (after years of using them with
>> these switches). I turned down the MTUs on the ceph nodes to the
>> standard 1500.
>>
>> More crashes.
>>
>> We decided to try and let things settle out overnight, with no IO.
>> That brings us to today:
>>
>> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have
>> crashed due to the suicide timeout. I've tried starting them one at a
>> time, they're still dying off with suicide timeouts.
>>
>> I've gathered the logs I could think to:
>> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
>> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
>> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
>> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt
>>
>> At the moment, we're dead in the water. I would appreciate any
>> pointers to getting this fixed.
>>
>> --
>> Adam Tygart
>> Beocat Sysadmin
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com