Hi Goncalo, Thank you for your response. I had already found that issue but it does not apply to my situation. The timezones are correct and I'm running a pure hammer cluster. - HP On Sat, 2016-08-13 at 12:23 +0000, Goncalo Borges wrote: > Hi HP. > > I am just a site admin so my opinion should be validated by proper > support staff > > Seems really similar to > http://tracker.ceph.com/issues/14399 > > The ticket speaks about timezone difference between osds. Maybe it is > something worthwhile to check? > > Cheers > Goncalo > > ________________________________________ > From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of > Hein-Pieter van Braam [hp@xxxxxx] > Sent: 13 August 2016 21:48 > To: ceph-users > Subject: Cascading failure on a placement group > > Hello all, > > My cluster started to lose OSDs without any warning, whenever an OSD > becomes the primary for a particular PG it crashes with the following > stacktrace: > > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: /usr/bin/ceph-osd() [0xada722] > 2: (()+0xf100) [0x7fc28bca5100] > 3: (gsignal()+0x37) [0x7fc28a6bd5f7] > 4: (abort()+0x148) [0x7fc28a6bece8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5] > 6: (()+0x5e946) [0x7fc28afc0946] > 7: (()+0x5e973) [0x7fc28afc0973] > 8: (()+0x5eb93) [0x7fc28afc0b93] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x27a) [0xbddcba] > 10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned > int)+0x75f) [0x87e48f] > 11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab] > 12: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) > [0x8a0d1a] > 13: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0x68a) [0x83be4a] > 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, > std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) > [0x69a5c5] > 15: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x333) [0x69ab33] > 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned > int)+0x86f) > [0xbcd1cf] > 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300] > 18: (()+0x7dc5) [0x7fc28bc9ddc5] > 19: (clone()+0x6d) [0x7fc28a77eced] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > Has anyone ever seen this? Is there a way to fix this? My cluster is > in > rather large disarray at the moment. I have one of the OSDs now in a > restart loop and that is at least preventing other OSDs from going > down, but obviously not all other PGs can peer now. > > I'm not sure what else to do at the moment. > > Thank you so much, > > - HP > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com