Hi, I hit this a few weeks ago, here is the related tracker. You might want to update it to reflect your case and upload logs. http://tracker.ceph.com/issues/17916 Nick > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Trygve Vea > Sent: 21 December 2016 20:18 > To: ceph-users <ceph-users@xxxxxxxx> > Subject: OSD will not start after heartbeatsuicide timeout, assert error from PGLog > > Hi, > > One of our OSDs have gone into a mode where it will throw an assert and die shortly after it has been started. > > The following assert is being thrown: > https://github.com/ceph/ceph/blob/v10.2.5/src/osd/PGLog.cc#L1036-L1047 > > --- begin dump of recent events --- > 0> 2016-12-21 17:05:57.975799 7f1d91d59800 -1 *** Caught signal (Aborted) ** in thread 7f1d91d59800 thread_name:ceph-osd > > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > 1: (()+0x91875a) [0x7f1d9268975a] > 2: (()+0xf100) [0x7f1d906ba100] > 3: (gsignal()+0x37) [0x7f1d8ec7c5f7] > 4: (abort()+0x148) [0x7f1d8ec7dce8] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f1d927866c7] > 6: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, > std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, > std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, DoutPrefixProvider const*, std::set<std::string, > std::less<std::string>, std::allocator<std::string> >*)+0xdc7) [0x7f1d92371ae7] > 7: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x490) [0x7f1d921cf440] > 8: (OSD::load_pgs()+0x9b6) [0x7f1d92105056] > 9: (OSD::init()+0x2086) [0x7f1d92117846] > 10: (main()+0x2c55) [0x7f1d9207b595] > 11: (__libc_start_main()+0xf5) [0x7f1d8ec68b15] > 12: (()+0x3549b9) [0x7f1d920c59b9] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > > It looks to me like prior to this, the osd died while hitting a suicide timeout: > > 7fafac213700 time 2016-12-21 16:50:13.038341 > common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fb001b3c4e5] > 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7fb001a7bf21] > 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7fb001a7c77e] > 4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7fb0014b289f] > 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7fb0014b3acb] > 6: (DispatchQueue::entry()+0x78a) [0x7fb001bfe45a] > 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb001b17cdd] > 8: (()+0x7dc5) [0x7fafffa68dc5] > 9: (clone()+0x6d) [0x7faffe0f3ced] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > > These timeouts started to occasionally occur after we upgraded to Jewel. I have saved a dump of the recent events prior to the > suicide timeout here: http://employee.tv.situla.bitbit.net/heartbeat_suicide.log > > > If the Ceph-project is interested in doing forensics on this, I still have the OSD available in its current state. > > My hypothesis is that some kind of inconsistencies have occurred as a result of the first assert error. > > Is this a bug? > > > Regards > -- > Trygve Vea > Redpill Linpro AS > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com