I've been attempting to work through this, finding the pgs that are causing hangs, determining if they are "safe" to remove, and removing them with ceph-objectstore-tool on osd 16. I'm now getting hangs (followed by suicide timeouts) referencing pgs that I've just removed, so this doesn't seem to be all there is to the issue. -- Adam On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP <brandon.morris.pmp@xxxxxxxxx> wrote: > Adam, > > We ran into similar issues when we get too many objects in bucket > (around 300 million). The .rgw.buckets.index pool became unable to complete > backfill operations. The only way we were able to get past it was to > export the offending placement group with the ceph-objectstore-tool and > re-import it into another OSD to complete the backfill. For us, the export > operation seemed to hang and took 8 hours to export, so if you do choose to > go down this route, be patient. > > From your logs, it appears that pg 32.10c is the offending PG on OSD.16. If > you are running into the same issue we did, when you go to export it there > will be a file that will hang. For whatever reason the leveldb metadata for > that file hangs and causes the backfill operation to suicide the OSD. > > If anyone from the community has an explanation for why this happens I would > love to know. We have run into this twice now on the Infernalis codebase. > We are in the process of rebuilding our cluster to Jewel, so can't say > whether or not it happens there as well. > > --------- > Here is the pertinent lines from your log. > > 2016-06-01 09:26:54.683922 7f34c5e41700 7 osd.16 pg_epoch: 497663 > pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771 > n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662) > [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1 > bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0 > undersized+degraded+remapped+backfilling+peered] send_push_op > 32:30966cd6:::100042c76a0.00000000:head v 250315'1040233 size 0 > recovery_info: > ObjectRecoveryInfo(32:30966cd6:::100042c76a0.00000000:head@250315'1040233, > size: 0, copy_subset: [], clone_subset: {}) > [...] > 2016-06-01 09:27:25.091411 7f34cf856700 1 heartbeat_map is_healthy > 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30 > [...] > 2016-06-01 09:31:57.201645 7f3510669700 1 heartbeat_map is_healthy > 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30 > 2016-06-01 09:31:57.201671 7f3510669700 1 heartbeat_map is_healthy > 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300 > common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const > ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time > 2016-06-01 09:31:57.201687 > common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0x7f35167bb5b5] > 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char > const*, long)+0x2e1) [0x7f35166f7bf1] > 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e] > 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c] > 5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b] > 6: (()+0x7dc5) [0x7f35146ecdc5] > 7: (clone()+0x6d) [0x7f3512d77ced] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, > const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687 > common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > > Brandon > > > On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mozes@xxxxxxx> wrote: >> >> Hello all, >> >> I'm running into an issue with ceph osds crashing over the last 4 >> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511. >> >> A little setup information: >> 26 hosts >> 2x 400GB Intel DC P3700 SSDs >> 12x6TB spinning disks >> 4x4TB spinning disks. >> >> The SSDs are used for both journals and as an OSD (for the cephfs >> metadata pool). >> >> We were running Ceph with some success in this configuration >> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10 >> months. >> >> Up through Friday, we were healthy. >> >> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and >> then finally dying off, hitting their suicide timeout due to missing >> heartbeats. At the time, we were running Infernalis, getting ready to >> upgrade to Jewel. >> >> I spent the weekend and Monday, attempting to stabilize those OSDs, >> unfortunately failing. As part of the stabilzation attempts, I check >> iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels, >> and overall SMART health of the SSDs, everything looks normal. I >> checked to make sure the time was in sync between all hosts. >> >> I also tried to move the metadata pool to the spinning disks (to >> remove some dependence on the SSDs, just in case). The suicide timeout >> issues followed the pool migration. The spinning disks started timing >> out. This was at a time when *all* of client the IOPs to the ceph >> cluster were in the low 100's as reported to by ceph -s. I was >> restarting failed OSDs as fast as they were dying and I couldn't keep >> up. I checked the switches and NICs for errors and drops. No changes >> in the frequency of them. We're talking an error every 20-25 minutes. >> I would expect network issues to affect other OSDs (and pools) in the >> system, too. >> >> On Tuesday, I got together with my coworker, and we tried together to >> stabilize the cluster. We finally went into emergency maintenance >> mode, as we could not get the metadata pool healthy. We stopped the >> MDS, we tried again to let things stabilize, with no client IO to the >> pool. Again more suicide timeouts. >> >> Then, we rebooted the ceph nodes, figuring there *might* be something >> stuck in a hardware IO queue or cache somewhere. Again more crashes >> when the machines came back up. >> >> We figured at this point, there was nothing to lose by performing the >> update to Jewel, and, who knows, maybe we were hitting a bug that had >> been fixed. Reboots were involved again (kernel updates, too). >> >> More crashes. >> >> I finally decided, that there *could* be an unlikely chance that jumbo >> frames might suddenly be an issue (after years of using them with >> these switches). I turned down the MTUs on the ceph nodes to the >> standard 1500. >> >> More crashes. >> >> We decided to try and let things settle out overnight, with no IO. >> That brings us to today: >> >> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have >> crashed due to the suicide timeout. I've tried starting them one at a >> time, they're still dying off with suicide timeouts. >> >> I've gathered the logs I could think to: >> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log >> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt >> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt >> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt >> >> At the moment, we're dead in the water. I would appreciate any >> pointers to getting this fixed. >> >> -- >> Adam Tygart >> Beocat Sysadmin >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com