Re: Luminous RC OSD Crashing

Ashley Merrick <ashley@xxxxxxxxxxxxxx> · Wed, 19 Jul 2017 12:21:07 +0000

Logged a bug ticket, let me know if need anything further : http://tracker.ceph.com/issues/20687

From: Ashley Merrick

Sent: Wednesday, 19 July 2017 8:05 PM

To: ceph-users@xxxxxxxx

Subject: RE: Luminous RC OSD Crashing

Also found this error on some of the OSD’s crashing:

2017-07-19 12:50:57.587194 7f19348f1700 -1 /build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: In function 'virtual void C_CopyFrom_AsyncReadCb::finish(int)' thread 7f19348f1700 time 2017-07-19 12:50:57.583192
/build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: 7585: FAILED assert(len <= reply_obj.data.length())

ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55f1c67bfe32]
2: (C_CopyFrom_AsyncReadCb::finish(int)+0x131) [0x55f1c63ec9e1]
3: (Context::complete(int)+0x9) [0x55f1c626b8b9]
4: (()+0x79bc70) [0x55f1c650fc70]
5: (ECBackend::kick_reads()+0x48) [0x55f1c651f908]
6: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x562) [0x55f1c652e162]
7: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x7f) [0x55f1c650495f]
8: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1077) [0x55f1c6519da7]
9: (ECBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x2a6) [0x55f1c651a946]
10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5e7) [0x55f1c638f667]
11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7) [0x55f1c622fb07]
12: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55f1c648a0a7]
13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x108c) [0x55f1c625b34c]
14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x93d) [0x55f1c67c5add]
15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f1c67c7d00]
16: (()+0x8064) [0x7f194cf89064]
17: (clone()+0x6d) [0x7f194c07d62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2017-07-19 12:50:46.691617 7f194a0ec700  1 -- 172.16.3.3:6806/3482 <== osd.28 172.16.3.4:6800/27027 18606 ==== MOSDECSubOpRead(6.71s2 102354/102344 ECSubRead(tid=605721, to_read={6:8e0c91b4:::rbd_data.61c662238e1f29.000000000000$
-9999> 2017-07-19 12:50:46.692100 7f19330ee700  1 -- 172.16.3.3:6806/3482 --> 172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 ECSubReadReply(tid=605720, attrs_read=0)) v2 -- 0x55f1d5083180 con 0
-9998> 2017-07-19 12:50:46.692388 7f19330ee700  1 -- 172.16.3.3:6806/3482 --> 172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 ECSubReadReply(tid=605721, attrs_read=0)) v2 -- 0x55f2412c1700 con 0

,Ashley

From: Ashley Merrick

Sent: Wednesday, 19 July 2017 7:08 PM

To: Ashley Merrick <ashley@xxxxxxxxxxxxxx>;
ceph-users@xxxxxxxx

Subject: RE: Luminous RC OSD Crashing

I have just found : 
http://tracker.ceph.com/issues/20167

Looks to be the same error in an earlier release : 12.0.2-1883-gb3f5819, is marked as resolved one month ago by Sage, however unable to see how and by what. However would guess this fix would have made it to latest RC?

,Ashley

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Ashley Merrick

Sent: Wednesday, 19 July 2017 5:47 PM

To: ceph-users@xxxxxxxx

Subject:  Luminous RC OSD Crashing

Hello,

Getting the following on random OSD’s crashing during a backfill/rebuilding on the latest RC, from the log’s so far I have seen the following:

172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 101931'55683 (0'0) error    6:b984d72a:::rbd_data.a1d870238e1f29.0000000000007c0b:head by client.30604127.0:31963
 0.000000 -2) v2 -- 0x55bea0faefc0 con 0

log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty [101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28)

failed to decode message of type 70 v3: buffer::malformed_input: void osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < struct_compat

Let me know if need anything else.

,Ashley

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com