I tried to compact the leveldb on osd 16 and the osd is still hitting the suicide timeout. I know I've got some users with more than 1 million files in single directories. Now that I'm in this situation, can I get some pointers on how can I use either of your options? Thanks, Adam On Wed, Jun 1, 2016 at 4:33 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > If that pool is your metadata pool, it looks at a quick glance like > it's timing out somewhere while reading and building up the omap > contents (ie, the contents of a directory). Which might make sense if, > say, you have very fragmented leveldb stores combined with very large > CephFS directories. Trying to make the leveldbs happier (I think there > are some options to compact on startup, etc?) might help; otherwise > you might be running into the same "too-large omap collections" thing > that Brandon referred to. Which in CephFS can be fixed by either > having smaller folders or (if you're very nervy, and ready to turn on > something we think works but don't test enough) enabling directory > fragmentation. > -Greg > > On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart <mozes@xxxxxxx> wrote: >> I've been attempting to work through this, finding the pgs that are >> causing hangs, determining if they are "safe" to remove, and removing >> them with ceph-objectstore-tool on osd 16. >> >> I'm now getting hangs (followed by suicide timeouts) referencing pgs >> that I've just removed, so this doesn't seem to be all there is to the >> issue. >> >> -- >> Adam >> >> On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP >> <brandon.morris.pmp@xxxxxxxxx> wrote: >>> Adam, >>> >>> We ran into similar issues when we get too many objects in bucket >>> (around 300 million). The .rgw.buckets.index pool became unable to complete >>> backfill operations. The only way we were able to get past it was to >>> export the offending placement group with the ceph-objectstore-tool and >>> re-import it into another OSD to complete the backfill. For us, the export >>> operation seemed to hang and took 8 hours to export, so if you do choose to >>> go down this route, be patient. >>> >>> From your logs, it appears that pg 32.10c is the offending PG on OSD.16. If >>> you are running into the same issue we did, when you go to export it there >>> will be a file that will hang. For whatever reason the leveldb metadata for >>> that file hangs and causes the backfill operation to suicide the OSD. >>> >>> If anyone from the community has an explanation for why this happens I would >>> love to know. We have run into this twice now on the Infernalis codebase. >>> We are in the process of rebuilding our cluster to Jewel, so can't say >>> whether or not it happens there as well. >>> >>> --------- >>> Here is the pertinent lines from your log. >>> >>> 2016-06-01 09:26:54.683922 7f34c5e41700 7 osd.16 pg_epoch: 497663 >>> pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771 >>> n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662) >>> [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1 >>> bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0 >>> undersized+degraded+remapped+backfilling+peered] send_push_op >>> 32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0 >>> recovery_info: >>> ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233, >>> size: 0, copy_subset: [], clone_subset: {}) >>> [...] >>> 2016-06-01 09:27:25.091411 7f34cf856700 1 heartbeat_map is_healthy >>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30 >>> [...] >>> 2016-06-01 09:31:57.201645 7f3510669700 1 heartbeat_map is_healthy >>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30 >>> 2016-06-01 09:31:57.201671 7f3510669700 1 heartbeat_map is_healthy >>> 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300 >>> common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const >>> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time >>> 2016-06-01 09:31:57.201687 >>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") >>> ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x85) [0x7f35167bb5b5] >>> 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char >>> const*, long)+0x2e1) [0x7f35166f7bf1] >>> 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e] >>> 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c] >>> 5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b] >>> 6: (()+0x7dc5) [0x7f35146ecdc5] >>> 7: (clone()+0x6d) [0x7f3512d77ced] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >>> interpret this. >>> 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In >>> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, >>> const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687 >>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") >>> >>> Brandon >>> >>> >>> On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote: >>>> >>>> Hello all, >>>> >>>> I'm running into an issue with ceph osds crashing over the last 4 >>>> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511. >>>> >>>> A little setup information: >>>> 26 hosts >>>> 2x 400GB Intel DC P3700 SSDs >>>> 12x6TB spinning disks >>>> 4x4TB spinning disks. >>>> >>>> The SSDs are used for both journals and as an OSD (for the cephfs >>>> metadata pool). >>>> >>>> We were running Ceph with some success in this configuration >>>> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10 >>>> months. >>>> >>>> Up through Friday, we were healthy. >>>> >>>> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and >>>> then finally dying off, hitting their suicide timeout due to missing >>>> heartbeats. At the time, we were running Infernalis, getting ready to >>>> upgrade to Jewel. >>>> >>>> I spent the weekend and Monday, attempting to stabilize those OSDs, >>>> unfortunately failing. As part of the stabilzation attempts, I check >>>> iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels, >>>> and overall SMART health of the SSDs, everything looks normal. I >>>> checked to make sure the time was in sync between all hosts. >>>> >>>> I also tried to move the metadata pool to the spinning disks (to >>>> remove some dependence on the SSDs, just in case). The suicide timeout >>>> issues followed the pool migration. The spinning disks started timing >>>> out. This was at a time when *all* of client the IOPs to the ceph >>>> cluster were in the low 100's as reported to by ceph -s. I was >>>> restarting failed OSDs as fast as they were dying and I couldn't keep >>>> up. I checked the switches and NICs for errors and drops. No changes >>>> in the frequency of them. We're talking an error every 20-25 minutes. >>>> I would expect network issues to affect other OSDs (and pools) in the >>>> system, too. >>>> >>>> On Tuesday, I got together with my coworker, and we tried together to >>>> stabilize the cluster. We finally went into emergency maintenance >>>> mode, as we could not get the metadata pool healthy. We stopped the >>>> MDS, we tried again to let things stabilize, with no client IO to the >>>> pool. Again more suicide timeouts. >>>> >>>> Then, we rebooted the ceph nodes, figuring there *might* be something >>>> stuck in a hardware IO queue or cache somewhere. Again more crashes >>>> when the machines came back up. >>>> >>>> We figured at this point, there was nothing to lose by performing the >>>> update to Jewel, and, who knows, maybe we were hitting a bug that had >>>> been fixed. Reboots were involved again (kernel updates, too). >>>> >>>> More crashes. >>>> >>>> I finally decided, that there *could* be an unlikely chance that jumbo >>>> frames might suddenly be an issue (after years of using them with >>>> these switches). I turned down the MTUs on the ceph nodes to the >>>> standard 1500. >>>> >>>> More crashes. >>>> >>>> We decided to try and let things settle out overnight, with no IO. >>>> That brings us to today: >>>> >>>> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have >>>> crashed due to the suicide timeout. I've tried starting them one at a >>>> time, they're still dying off with suicide timeouts. >>>> >>>> I've gathered the logs I could think to: >>>> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log >>>> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt >>>> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt >>>> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt >>>> >>>> At the moment, we're dead in the water. I would appreciate any >>>> pointers to getting this fixed. >>>> >>>> -- >>>> Adam Tygart >>>> Beocat Sysadmin >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com