Hi, we are running Ceph v.0.72.2 (emperor) from the ceph emperor repo. The latest week we had 2 random OSD crashes (one during cluster recovery and one while in healthy state) with the same symptom: osd process crashes, logs the following trace on its log and gets down and out. We are in the process of preparing our cluster upgrade to firefly, but we would like to know if this is a known bug fixed in more recent versions and more about troubleshooting the specific failure. On which subsystems could we increase their debugging level to provide more info? 2015-03-16 20:44:18.768488 7f516d4c9700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequestRef)' thread 7f516d4c9700 time 2015-03-16 20:44:18.764353 osd/ReplicatedPG.cc: 5570: FAILED assert(!pg_log.get_missing().is_missing(soid)) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: (ReplicatedPG::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0xae0) [0x9182c0] 2: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr<OpRequest>)+0x117) [0x9184f7] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x381) [0x8f12a1] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x316) [0x6f7096] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x198) [0x70e048] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7494ce] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xa517fa] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa52a50] 9: (()+0x6b50) [0x7f5199f52b50] 10: (clone()+0x6d) [0x7f519871e70d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- Regards, Kostis _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com