OSD will not start after heartbeatsuicide timeout, assert error from PGLog

Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx> · Wed, 21 Dec 2016 21:18:11 +0100 (CET)

Hi,

One of our OSDs have gone into a mode where it will throw an assert and die shortly after it has been started.

The following assert is being thrown:
https://github.com/ceph/ceph/blob/v10.2.5/src/osd/PGLog.cc#L1036-L1047

--- begin dump of recent events ---
     0> 2016-12-21 17:05:57.975799 7f1d91d59800 -1 *** Caught signal (Aborted) **
 in thread 7f1d91d59800 thread_name:ceph-osd

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x91875a) [0x7f1d9268975a]
 2: (()+0xf100) [0x7f1d906ba100]
 3: (gsignal()+0x37) [0x7f1d8ec7c5f7]
 4: (abort()+0x148) [0x7f1d8ec7dce8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f1d927866c7]
 6: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xdc7) [0x7f1d92371ae7]
 7: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x490) [0x7f1d921cf440]
 8: (OSD::load_pgs()+0x9b6) [0x7f1d92105056]
 9: (OSD::init()+0x2086) [0x7f1d92117846]
 10: (main()+0x2c55) [0x7f1d9207b595]
 11: (__libc_start_main()+0xf5) [0x7f1d8ec68b15]
 12: (()+0x3549b9) [0x7f1d920c59b9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It looks to me like prior to this, the osd died while hitting a suicide timeout:

7fafac213700 time 2016-12-21 16:50:13.038341
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fb001b3c4e5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7fb001a7bf21]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7fb001a7c77e]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7fb0014b289f]
 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7fb0014b3acb]
 6: (DispatchQueue::entry()+0x78a) [0x7fb001bfe45a]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb001b17cdd]
 8: (()+0x7dc5) [0x7fafffa68dc5]
 9: (clone()+0x6d) [0x7faffe0f3ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

These timeouts started to occasionally occur after we upgraded to Jewel.  I have saved a dump of the recent events prior to the suicide timeout here: http://employee.tv.situla.bitbit.net/heartbeat_suicide.log

If the Ceph-project is interested in doing forensics on this, I still have the OSD available in its current state.

My hypothesis is that some kind of inconsistencies have occurred as a result of the first assert error.

Is this a bug?

Regards
-- 
Trygve Vea
Redpill Linpro AS
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com