Also found this error on some of the OSD’s crashing: 2017-07-19 12:50:57.587194 7f19348f1700 -1 /build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: In function 'virtual void C_CopyFrom_AsyncReadCb::finish(int)' thread 7f19348f1700 time 2017-07-19 12:50:57.583192 /build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: 7585: FAILED assert(len <= reply_obj.data.length()) ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55f1c67bfe32] 2: (C_CopyFrom_AsyncReadCb::finish(int)+0x131) [0x55f1c63ec9e1] 3: (Context::complete(int)+0x9) [0x55f1c626b8b9] 4: (()+0x79bc70) [0x55f1c650fc70] 5: (ECBackend::kick_reads()+0x48) [0x55f1c651f908] 6: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x562) [0x55f1c652e162] 7: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x7f) [0x55f1c650495f] 8: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1077) [0x55f1c6519da7] 9: (ECBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x2a6) [0x55f1c651a946] 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5e7) [0x55f1c638f667] 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7) [0x55f1c622fb07] 12: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55f1c648a0a7] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x108c) [0x55f1c625b34c] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x93d) [0x55f1c67c5add] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f1c67c7d00] 16: (()+0x8064) [0x7f194cf89064] 17: (clone()+0x6d) [0x7f194c07d62d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -10000> 2017-07-19 12:50:46.691617 7f194a0ec700 1 -- 172.16.3.3:6806/3482 <== osd.28 172.16.3.4:6800/27027 18606 ==== MOSDECSubOpRead(6.71s2 102354/102344 ECSubRead(tid=605721, to_read={6:8e0c91b4:::rbd_data.61c662238e1f29.000000000000$ -9999> 2017-07-19 12:50:46.692100 7f19330ee700 1 -- 172.16.3.3:6806/3482 --> 172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 ECSubReadReply(tid=605720, attrs_read=0)) v2 -- 0x55f1d5083180 con 0 -9998> 2017-07-19 12:50:46.692388 7f19330ee700 1 -- 172.16.3.3:6806/3482 --> 172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 ECSubReadReply(tid=605721, attrs_read=0)) v2 -- 0x55f2412c1700 con 0 ,Ashley From: Ashley Merrick
I have just found :
http://tracker.ceph.com/issues/20167 Looks to be the same error in an earlier release : 12.0.2-1883-gb3f5819, is marked as resolved one month ago by Sage, however unable to see how and by what. However would guess this fix would have made it to latest RC? ,Ashley From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Ashley Merrick Hello, Getting the following on random OSD’s crashing during a backfill/rebuilding on the latest RC, from the log’s so far I have seen the following: 172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 101931'55683 (0'0) error 6:b984d72a:::rbd_data.a1d870238e1f29.0000000000007c0b:head by client.30604127.0:31963
0.000000 -2) v2 -- 0x55bea0faefc0 con 0 log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty [101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28) failed to decode message of type 70 v3: buffer::malformed_input: void osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < struct_compat Let me know if need anything else. ,Ashley |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com