Hi HP. I am just a site admin so my opinion should be validated by proper support staff Seems really similar to http://tracker.ceph.com/issues/14399 The ticket speaks about timezone difference between osds. Maybe it is something worthwhile to check? Cheers Goncalo ________________________________________ From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Hein-Pieter van Braam [hp@xxxxxx] Sent: 13 August 2016 21:48 To: ceph-users Subject: Cascading failure on a placement group Hello all, My cluster started to lose OSDs without any warning, whenever an OSD becomes the primary for a particular PG it crashes with the following stacktrace: ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: /usr/bin/ceph-osd() [0xada722] 2: (()+0xf100) [0x7fc28bca5100] 3: (gsignal()+0x37) [0x7fc28a6bd5f7] 4: (abort()+0x148) [0x7fc28a6bece8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5] 6: (()+0x5e946) [0x7fc28afc0946] 7: (()+0x5e973) [0x7fc28afc0973] 8: (()+0x5eb93) [0x7fc28afc0b93] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbddcba] 10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e48f] 11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab] 12: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a] 13: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x83be4a] 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5] 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69ab33] 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd1cf] 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300] 18: (()+0x7dc5) [0x7fc28bc9ddc5] 19: (clone()+0x6d) [0x7fc28a77eced] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Has anyone ever seen this? Is there a way to fix this? My cluster is in rather large disarray at the moment. I have one of the OSDs now in a restart loop and that is at least preventing other OSDs from going down, but obviously not all other PGs can peer now. I'm not sure what else to do at the moment. Thank you so much, - HP _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com