Hi Greg, I did some investigation over the issue - http://tracker.ceph.com/issues/10080. Can you help to take a look if the analysis make sense or not? Thanks, Guang ________________________________ > Date: Mon, 24 Nov 2014 00:59:05 -0800 > From: redmine@xxxxxxxxxxxxxxxx > Subject: [Ceph - Bug #10080] Pipe::connect() cause osd crash when osd > reconnect to its peer > > Issue #10080 has been updated by Guang Yang. > > I am wondering if the following race occurred: > > Let us assume A and B are two OSDs having the connection (pipe) between > each other. > > 1. B issued a re-connection for whatever reason, and at the same > time, A marked down and destroyed the Pipe with B. > 2. Let us assume B having: cs = 100, in_seq = 500 > 3. The connection is established with cs = 101 > 4. For whatever reason, A came across a failure during read and > issued a new connection request with cs = 102, currently A has out_seq > = 0 (as it is a brand new Pipe) and cs = 102. > 5. B accepted the connection request and responded back in_seq = 500 > (already wrong here) > 6. A got the in_seq and comparing with its internal out_seq and > out_q, crashed with assertion failure. > > If this is the case, it seems one step was missed during seq > negotiation, that is, when B tried to do a new connection and detected > that A has a reset, it should reset its in_seq as well? > > Thanks, > Guang > > ________________________________ > Bug #10080: Pipe::connect() cause osd crash when osd reconnect to its > peer<http://tracker.ceph.com/issues/10080#change-44767> > > * Author: Wenjun Huang > * Status: New > * Priority: High > * Assignee: > * Category: msgr > * Target version: > * Source: Community (user) > * Backport: > * Tags: > * Severity: 3 - minor > * Reviewed: > * Suite: > > When our cluster load is heavy, the osd sometimes crashes. The critical > log is as below: > > 278> 2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 > OSD::ms_get_authorizer type=osd > -277> 2014-08-20 11:04:28.609783 7f89636c8700 2 - > 10.193.207.117:6816/44281>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 > sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). got newly_acked_seq > 546 vs out_seq 0 > 276> 2014-08-20 11:04:28.609810 7f89636c8700 2 - > 10.193.207.117:6816/44281>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 > sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding > previously sent 1 osd_map(727..755 src has 1..755) v3 > 275> 2014-08-20 11:04:28.609859 7f89636c8700 2 - > 10.193.207.117:6816/44281>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 > sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding > previously sent 2 pg_notify(1.2b(22),2.2c(23) epoch 755) v5 > > 2014-08-20 11:04:28.608141 7f89629bb700 0 -- 10.193.207.117:6816/44281 >>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=134 :6816 s=2 > pgs=236754 cs=3 l=0 c=0x44318c0).fault, initiating reconnect > 2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 > OSD::ms_get_authorizer type=osd > 2014-08-20 11:04:28.666895 7f89636c8700 -1 msg/Pipe.cc: In function > 'int Pipe::connect()' thread 7f89636c8700 time 2014-08-20 > 11:04:28.618536 msg/Pipe.cc: 1080: FAILED assert(m) > > Looking into the log, we can see, the out_seq is 0. As our cluster has > enabled the cephx authorization, from the source code, I am informed > that the out_seq is initialized by a random number. So there should be > some bugs in the source code. > > We face the crash issue, almost every time our cluster load is heavy. > So, I think it is a critical bug for ceph. > > ________________________________ > > You have received this notification because you have either subscribed > to it, or are involved in it. > To change your notification preferences, please click here: > http://tracker.ceph.com/my/account ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f