Re: OSD will not start after heartbeatsuicide timeout, assert error from PGLog

Nick Fisk <nick@xxxxxxxxxx> · Thu, 22 Dec 2016 09:47:28 -0000

Hi,

I hit this a few weeks ago, here is the related tracker. You might want to update it to reflect your case and upload logs.

http://tracker.ceph.com/issues/17916

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Trygve Vea
> Sent: 21 December 2016 20:18
> To: ceph-users <ceph-users@xxxxxxxx>
> Subject:  OSD will not start after heartbeatsuicide timeout, assert error from PGLog
> 
> Hi,
> 
> One of our OSDs have gone into a mode where it will throw an assert and die shortly after it has been started.
> 
> The following assert is being thrown:
> https://github.com/ceph/ceph/blob/v10.2.5/src/osd/PGLog.cc#L1036-L1047
> 
> --- begin dump of recent events ---
>      0> 2016-12-21 17:05:57.975799 7f1d91d59800 -1 *** Caught signal (Aborted) **  in thread 7f1d91d59800 thread_name:ceph-osd
> 
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x91875a) [0x7f1d9268975a]
>  2: (()+0xf100) [0x7f1d906ba100]
>  3: (gsignal()+0x37) [0x7f1d8ec7c5f7]
>  4: (abort()+0x148) [0x7f1d8ec7dce8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f1d927866c7]
>  6: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t,
> std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
> std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, DoutPrefixProvider const*, std::set<std::string,
> std::less<std::string>, std::allocator<std::string> >*)+0xdc7) [0x7f1d92371ae7]
>  7: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x490) [0x7f1d921cf440]
>  8: (OSD::load_pgs()+0x9b6) [0x7f1d92105056]
>  9: (OSD::init()+0x2086) [0x7f1d92117846]
>  10: (main()+0x2c55) [0x7f1d9207b595]
>  11: (__libc_start_main()+0xf5) [0x7f1d8ec68b15]
>  12: (()+0x3549b9) [0x7f1d920c59b9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> It looks to me like prior to this, the osd died while hitting a suicide timeout:
> 
> 7fafac213700 time 2016-12-21 16:50:13.038341
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
> 
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fb001b3c4e5]
>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7fb001a7bf21]
>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7fb001a7c77e]
>  4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7fb0014b289f]
>  5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7fb0014b3acb]
>  6: (DispatchQueue::entry()+0x78a) [0x7fb001bfe45a]
>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb001b17cdd]
>  8: (()+0x7dc5) [0x7fafffa68dc5]
>  9: (clone()+0x6d) [0x7faffe0f3ced]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> These timeouts started to occasionally occur after we upgraded to Jewel.  I have saved a dump of the recent events prior to the
> suicide timeout here: http://employee.tv.situla.bitbit.net/heartbeat_suicide.log
> 
> 
> If the Ceph-project is interested in doing forensics on this, I still have the OSD available in its current state.
> 
> My hypothesis is that some kind of inconsistencies have occurred as a result of the first assert error.
> 
> Is this a bug?
> 
> 
> Regards
> --
> Trygve Vea
> Redpill Linpro AS
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com