Re: Crashing OSDs (suicide timeout, following a single pool)

Adam Tygart <mozes@xxxxxxx> · Thu, 2 Jun 2016 11:49:29 -0500

Okay,

Exporting, removing and importing the pgs seems to be working
(slowly). The question now becomes, why does and export/import work?
That would make me think there is a bug in there somewhere in the pg
loading code. Or does it have to do with re-creating the leveldb
databases? The same number of objects are still in each pg, along with
the same number of omap keys... Something doesn't seem quite right.

If it is too many files in a single directory, what would be the upper
limit to target? I'd like to know when I should be yelling and kicking
and screaming at my users to fix their code.

On Wed, Jun 1, 2016 at 6:07 PM, Brandon Morris, PMP
<brandon.morris.pmp@xxxxxxxxx> wrote:
> I concur with Greg.
>
> The only way that I was able to get back to Health_OK was to export/import.
> ***** Please note, any time you use the ceph_objectstore_tool you risk data
> loss if not done carefully.   Never remove a PG until you have a known good
> export *****
>
> Here are the steps I used:
>
> 1. set NOOUT, NO BACKFILL
> 2. Stop the OSD's that have the erroring PG
> 3. Flush the journal and export the primary version of the PG.  This took 1
> minute on a well-behaved PG and 4 hours on the misbehaving PG
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
> 4. Import the PG into a New / Temporary OSD that is also offline,
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case
> it looks like)
> 6. Start cluster OSD's
> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3
> OSD's it is supposed to be on.
>
> This is similar to the recovery process described in this post from
> 04/09/2015:
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
> Hopefully it works in your case too and you can the cluster back to a state
> that you can make the CephFS directories smaller.
>
> - Brandon
>
> On Wed, Jun 1, 2016 at 4:22 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>
>> On Wed, Jun 1, 2016 at 2:47 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>> > I tried to compact the leveldb on osd 16 and the osd is still hitting
>> > the suicide timeout. I know I've got some users with more than 1
>> > million files in single directories.
>> >
>> > Now that I'm in this situation, can I get some pointers on how can I
>> > use either of your options?
>>
>> In a literal sense, you either make the CephFS directories smaller by
>> moving files out of them. Or you enable directory fragmentation with
>>
>> http://docs.ceph.com/docs/master/cephfs/experimental-features/#directory-fragmentation,
>> but if you have users I *really* wouldn't recommend it just yet.
>> (Notice the text about these experimental features being "not fully
>> stabilized or qualified for users to turn on in real deployments")
>>
>> Since you're doing recovery, you should be able to do the
>> ceph-objectstore-tool export/import thing to get the PG to its new
>> locations, but just deleting it certainly won't help!
>> -Greg
>>
>> >
>> > Thanks,
>> > Adam
>> >
>> > On Wed, Jun 1, 2016 at 4:33 PM, Gregory Farnum <gfarnum@xxxxxxxxxx>
>> > wrote:
>> >> If that pool is your metadata pool, it looks at a quick glance like
>> >> it's timing out somewhere while reading and building up the omap
>> >> contents (ie, the contents of a directory). Which might make sense if,
>> >> say, you have very fragmented leveldb stores combined with very large
>> >> CephFS directories. Trying to make the leveldbs happier (I think there
>> >> are some options to compact on startup, etc?) might help; otherwise
>> >> you might be running into the same "too-large omap collections" thing
>> >> that Brandon referred to. Which in CephFS can be fixed by either
>> >> having smaller folders or (if you're very nervy, and ready to turn on
>> >> something we think works but don't test enough) enabling directory
>> >> fragmentation.
>> >> -Greg
>> >>
>> >> On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>> >>> I've been attempting to work through this, finding the pgs that are
>> >>> causing hangs, determining if they are "safe" to remove, and removing
>> >>> them with ceph-objectstore-tool on osd 16.
>> >>>
>> >>> I'm now getting hangs (followed by suicide timeouts) referencing pgs
>> >>> that I've just removed, so this doesn't seem to be all there is to the
>> >>> issue.
>> >>>
>> >>> --
>> >>> Adam
>> >>>
>> >>> On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP
>> >>> <brandon.morris.pmp@xxxxxxxxx> wrote:
>> >>>> Adam,
>> >>>>
>> >>>>      We ran into similar issues when we get too many objects in
>> >>>> bucket
>> >>>> (around 300 million).  The .rgw.buckets.index pool became unable to
>> >>>> complete
>> >>>> backfill operations.    The only way we were able to get past it was
>> >>>> to
>> >>>> export the offending placement group with the ceph-objectstore-tool
>> >>>> and
>> >>>> re-import it into another OSD to complete the backfill.  For us, the
>> >>>> export
>> >>>> operation seemed to hang and took 8 hours to export, so if you do
>> >>>> choose to
>> >>>> go down this route, be patient.
>> >>>>
>> >>>> From your logs, it appears that pg 32.10c is the offending PG on
>> >>>> OSD.16.  If
>> >>>> you are running into the same issue we did, when you go to export it
>> >>>> there
>> >>>> will be a file that will hang. For whatever reason the leveldb
>> >>>> metadata for
>> >>>> that file hangs and causes the backfill operation to suicide the OSD.
>> >>>>
>> >>>> If anyone from the community has an explanation for why this happens
>> >>>> I would
>> >>>> love to know.  We have run into this twice now on the Infernalis
>> >>>> codebase.
>> >>>> We are in the process of rebuilding our cluster to Jewel, so can't
>> >>>> say
>> >>>> whether or not it happens there as well.
>> >>>>
>> >>>> ---------
>> >>>> Here is the pertinent lines from your log.
>> >>>>
>> >>>> 2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
>> >>>> pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561]
>> >>>> local-les=493771
>> >>>> n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662)
>> >>>> [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1
>> >>>> bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
>> >>>> undersized+degraded+remapped+backfilling+peered] send_push_op
>> >>>> 32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0
>> >>>> recovery_info:
>> >>>>
>> >>>> ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233,
>> >>>> size: 0, copy_subset: [], clone_subset: {})
>> >>>> [...]
>> >>>> 2016-06-01 09:27:25.091411 7f34cf856700  1 heartbeat_map is_healthy
>> >>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
>> >>>> [...]
>> >>>> 2016-06-01 09:31:57.201645 7f3510669700  1 heartbeat_map is_healthy
>> >>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
>> >>>> 2016-06-01 09:31:57.201671 7f3510669700  1 heartbeat_map is_healthy
>> >>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after
>> >>>> 300
>> >>>> common/HeartbeatMap.cc: In function 'bool
>> >>>> ceph::HeartbeatMap::_check(const
>> >>>> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700
>> >>>> time
>> >>>> 2016-06-01 09:31:57.201687
>> >>>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>> >>>>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>> >>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >>>> const*)+0x85) [0x7f35167bb5b5]
>> >>>>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
>> >>>> const*, long)+0x2e1) [0x7f35166f7bf1]
>> >>>>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e]
>> >>>>  4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c]
>> >>>>  5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b]
>> >>>>  6: (()+0x7dc5) [0x7f35146ecdc5]
>> >>>>  7: (clone()+0x6d) [0x7f3512d77ced]
>> >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >>>> needed to
>> >>>> interpret this.
>> >>>> 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In
>> >>>> function 'bool ceph::HeartbeatMap::_check(const
>> >>>> ceph::heartbeat_handle_d*,
>> >>>> const char*, time_t)' thread 7f3510669700 time 2016-06-01
>> >>>> 09:31:57.201687
>> >>>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>> >>>>
>> >>>> Brandon
>> >>>>
>> >>>>
>> >>>> On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>> >>>>>
>> >>>>> Hello all,
>> >>>>>
>> >>>>> I'm running into an issue with ceph osds crashing over the last 4
>> >>>>> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.
>> >>>>>
>> >>>>> A little setup information:
>> >>>>> 26 hosts
>> >>>>> 2x 400GB Intel DC P3700 SSDs
>> >>>>> 12x6TB spinning disks
>> >>>>> 4x4TB spinning disks.
>> >>>>>
>> >>>>> The SSDs are used for both journals and as an OSD (for the cephfs
>> >>>>> metadata pool).
>> >>>>>
>> >>>>> We were running Ceph with some success in this configuration
>> >>>>> (upgrading ceph from hammer to infernalis to jewel) for the past
>> >>>>> 8-10
>> >>>>> months.
>> >>>>>
>> >>>>> Up through Friday, we were healthy.
>> >>>>>
>> >>>>> Until Saturday. On Saturday, the OSDs on the SSDs started flapping
>> >>>>> and
>> >>>>> then finally dying off, hitting their suicide timeout due to missing
>> >>>>> heartbeats. At the time, we were running Infernalis, getting ready
>> >>>>> to
>> >>>>> upgrade to Jewel.
>> >>>>>
>> >>>>> I spent the weekend and Monday, attempting to stabilize those OSDs,
>> >>>>> unfortunately failing. As part of the stabilzation attempts, I check
>> >>>>> iostat -x, the SSDs were seeing 1000 iops each. I checked wear
>> >>>>> levels,
>> >>>>> and overall SMART health of the SSDs, everything looks normal. I
>> >>>>> checked to make sure the time was in sync between all hosts.
>> >>>>>
>> >>>>> I also tried to move the metadata pool to the spinning disks (to
>> >>>>> remove some dependence on the SSDs, just in case). The suicide
>> >>>>> timeout
>> >>>>> issues followed the pool migration. The spinning disks started
>> >>>>> timing
>> >>>>> out. This was at a time when *all* of client the IOPs to the ceph
>> >>>>> cluster were in the low 100's as reported to by ceph -s. I was
>> >>>>> restarting failed OSDs as fast as they were dying and I couldn't
>> >>>>> keep
>> >>>>> up. I checked the switches and NICs for errors and drops. No changes
>> >>>>> in the frequency of them. We're talking an error every 20-25
>> >>>>> minutes.
>> >>>>> I would expect network issues to affect other OSDs (and pools) in
>> >>>>> the
>> >>>>> system, too.
>> >>>>>
>> >>>>> On Tuesday, I got together with my coworker, and we tried together
>> >>>>> to
>> >>>>> stabilize the cluster. We finally went into emergency maintenance
>> >>>>> mode, as we could not get the metadata pool healthy. We stopped the
>> >>>>> MDS, we tried again to let things stabilize, with no client IO to
>> >>>>> the
>> >>>>> pool. Again more suicide timeouts.
>> >>>>>
>> >>>>> Then, we rebooted the ceph nodes, figuring there *might* be
>> >>>>> something
>> >>>>> stuck in a hardware IO queue or cache somewhere. Again more crashes
>> >>>>> when the machines came back up.
>> >>>>>
>> >>>>> We figured at this point, there was nothing to lose by performing
>> >>>>> the
>> >>>>> update to Jewel, and, who knows, maybe we were hitting a bug that
>> >>>>> had
>> >>>>> been fixed. Reboots were involved again (kernel updates, too).
>> >>>>>
>> >>>>> More crashes.
>> >>>>>
>> >>>>> I finally decided, that there *could* be an unlikely chance that
>> >>>>> jumbo
>> >>>>> frames might suddenly be an issue (after years of using them with
>> >>>>> these switches). I turned down the MTUs on the ceph nodes to the
>> >>>>> standard 1500.
>> >>>>>
>> >>>>> More crashes.
>> >>>>>
>> >>>>> We decided to try and let things settle out overnight, with no IO.
>> >>>>> That brings us to today:
>> >>>>>
>> >>>>> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them
>> >>>>> have
>> >>>>> crashed due to the suicide timeout. I've tried starting them one at
>> >>>>> a
>> >>>>> time, they're still dying off with suicide timeouts.
>> >>>>>
>> >>>>> I've gathered the logs I could think to:
>> >>>>> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
>> >>>>> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
>> >>>>> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
>> >>>>> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt
>> >>>>>
>> >>>>> At the moment, we're dead in the water. I would appreciate any
>> >>>>> pointers to getting this fixed.
>> >>>>>
>> >>>>> --
>> >>>>> Adam Tygart
>> >>>>> Beocat Sysadmin
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list
>> >>>>> ceph-users@xxxxxxxxxxxxxx
>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>>
>> >>>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> ceph-users@xxxxxxxxxxxxxx
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com