Re: Crashing OSDs (suicide timeout, following a single pool)

Adam Tygart <mozes@xxxxxxx> · Wed, 1 Jun 2016 16:47:25 -0500

I tried to compact the leveldb on osd 16 and the osd is still hitting
the suicide timeout. I know I've got some users with more than 1
million files in single directories.

Now that I'm in this situation, can I get some pointers on how can I
use either of your options?

Thanks,
Adam

On Wed, Jun 1, 2016 at 4:33 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> If that pool is your metadata pool, it looks at a quick glance like
> it's timing out somewhere while reading and building up the omap
> contents (ie, the contents of a directory). Which might make sense if,
> say, you have very fragmented leveldb stores combined with very large
> CephFS directories. Trying to make the leveldbs happier (I think there
> are some options to compact on startup, etc?) might help; otherwise
> you might be running into the same "too-large omap collections" thing
> that Brandon referred to. Which in CephFS can be fixed by either
> having smaller folders or (if you're very nervy, and ready to turn on
> something we think works but don't test enough) enabling directory
> fragmentation.
> -Greg
>
> On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>> I've been attempting to work through this, finding the pgs that are
>> causing hangs, determining if they are "safe" to remove, and removing
>> them with ceph-objectstore-tool on osd 16.
>>
>> I'm now getting hangs (followed by suicide timeouts) referencing pgs
>> that I've just removed, so this doesn't seem to be all there is to the
>> issue.
>>
>> --
>> Adam
>>
>> On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP
>> <brandon.morris.pmp@xxxxxxxxx> wrote:
>>> Adam,
>>>
>>>      We ran into similar issues when we get too many objects in bucket
>>> (around 300 million).  The .rgw.buckets.index pool became unable to complete
>>> backfill operations.    The only way we were able to get past it was to
>>> export the offending placement group with the ceph-objectstore-tool and
>>> re-import it into another OSD to complete the backfill.  For us, the export
>>> operation seemed to hang and took 8 hours to export, so if you do choose to
>>> go down this route, be patient.
>>>
>>> From your logs, it appears that pg 32.10c is the offending PG on OSD.16.  If
>>> you are running into the same issue we did, when you go to export it there
>>> will be a file that will hang. For whatever reason the leveldb metadata for
>>> that file hangs and causes the backfill operation to suicide the OSD.
>>>
>>> If anyone from the community has an explanation for why this happens I would
>>> love to know.  We have run into this twice now on the Infernalis codebase.
>>> We are in the process of rebuilding our cluster to Jewel, so can't say
>>> whether or not it happens there as well.
>>>
>>> ---------
>>> Here is the pertinent lines from your log.
>>>
>>> 2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
>>> pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771
>>> n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662)
>>> [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1
>>> bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
>>> undersized+degraded+remapped+backfilling+peered] send_push_op
>>> 32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0
>>> recovery_info:
>>> ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233,
>>> size: 0, copy_subset: [], clone_subset: {})
>>> [...]
>>> 2016-06-01 09:27:25.091411 7f34cf856700  1 heartbeat_map is_healthy
>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
>>> [...]
>>> 2016-06-01 09:31:57.201645 7f3510669700  1 heartbeat_map is_healthy
>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
>>> 2016-06-01 09:31:57.201671 7f3510669700  1 heartbeat_map is_healthy
>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300
>>> common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const
>>> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time
>>> 2016-06-01 09:31:57.201687
>>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>>>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0x7f35167bb5b5]
>>>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
>>> const*, long)+0x2e1) [0x7f35166f7bf1]
>>>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e]
>>>  4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c]
>>>  5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b]
>>>  6: (()+0x7dc5) [0x7f35146ecdc5]
>>>  7: (clone()+0x6d) [0x7f3512d77ced]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>>> interpret this.
>>> 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In
>>> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
>>> const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687
>>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>>>
>>> Brandon
>>>
>>>
>>> On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>>>>
>>>> Hello all,
>>>>
>>>> I'm running into an issue with ceph osds crashing over the last 4
>>>> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.
>>>>
>>>> A little setup information:
>>>> 26 hosts
>>>> 2x 400GB Intel DC P3700 SSDs
>>>> 12x6TB spinning disks
>>>> 4x4TB spinning disks.
>>>>
>>>> The SSDs are used for both journals and as an OSD (for the cephfs
>>>> metadata pool).
>>>>
>>>> We were running Ceph with some success in this configuration
>>>> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10
>>>> months.
>>>>
>>>> Up through Friday, we were healthy.
>>>>
>>>> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
>>>> then finally dying off, hitting their suicide timeout due to missing
>>>> heartbeats. At the time, we were running Infernalis, getting ready to
>>>> upgrade to Jewel.
>>>>
>>>> I spent the weekend and Monday, attempting to stabilize those OSDs,
>>>> unfortunately failing. As part of the stabilzation attempts, I check
>>>> iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,
>>>> and overall SMART health of the SSDs, everything looks normal. I
>>>> checked to make sure the time was in sync between all hosts.
>>>>
>>>> I also tried to move the metadata pool to the spinning disks (to
>>>> remove some dependence on the SSDs, just in case). The suicide timeout
>>>> issues followed the pool migration. The spinning disks started timing
>>>> out. This was at a time when *all* of client the IOPs to the ceph
>>>> cluster were in the low 100's as reported to by ceph -s. I was
>>>> restarting failed OSDs as fast as they were dying and I couldn't keep
>>>> up. I checked the switches and NICs for errors and drops. No changes
>>>> in the frequency of them. We're talking an error every 20-25 minutes.
>>>> I would expect network issues to affect other OSDs (and pools) in the
>>>> system, too.
>>>>
>>>> On Tuesday, I got together with my coworker, and we tried together to
>>>> stabilize the cluster. We finally went into emergency maintenance
>>>> mode, as we could not get the metadata pool healthy. We stopped the
>>>> MDS, we tried again to let things stabilize, with no client IO to the
>>>> pool. Again more suicide timeouts.
>>>>
>>>> Then, we rebooted the ceph nodes, figuring there *might* be something
>>>> stuck in a hardware IO queue or cache somewhere. Again more crashes
>>>> when the machines came back up.
>>>>
>>>> We figured at this point, there was nothing to lose by performing the
>>>> update to Jewel, and, who knows, maybe we were hitting a bug that had
>>>> been fixed. Reboots were involved again (kernel updates, too).
>>>>
>>>> More crashes.
>>>>
>>>> I finally decided, that there *could* be an unlikely chance that jumbo
>>>> frames might suddenly be an issue (after years of using them with
>>>> these switches). I turned down the MTUs on the ceph nodes to the
>>>> standard 1500.
>>>>
>>>> More crashes.
>>>>
>>>> We decided to try and let things settle out overnight, with no IO.
>>>> That brings us to today:
>>>>
>>>> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have
>>>> crashed due to the suicide timeout. I've tried starting them one at a
>>>> time, they're still dying off with suicide timeouts.
>>>>
>>>> I've gathered the logs I could think to:
>>>> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
>>>> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
>>>> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
>>>> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt
>>>>
>>>> At the moment, we're dead in the water. I would appreciate any
>>>> pointers to getting this fixed.
>>>>
>>>> --
>>>> Adam Tygart
>>>> Beocat Sysadmin
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com