FW: [Ceph - Bug #10080] Pipe::connect() cause osd crash when osd reconnect to its peer

GuangYang <yguang11@xxxxxxxxxxx> · Mon, 24 Nov 2014 09:04:39 +0000

Hi Greg,
I did some investigation over the issue - http://tracker.ceph.com/issues/10080. Can you help to take a look if the analysis make sense or not?

Thanks,
Guang

________________________________
> Date: Mon, 24 Nov 2014 00:59:05 -0800 
> From: redmine@xxxxxxxxxxxxxxxx 
> Subject: [Ceph - Bug #10080] Pipe::connect() cause osd crash when osd 
> reconnect to its peer 
> 
> Issue #10080 has been updated by Guang Yang. 
> 
> I am wondering if the following race occurred: 
> 
> Let us assume A and B are two OSDs having the connection (pipe) between 
> each other. 
> 
> 1. B issued a re-connection for whatever reason, and at the same 
> time, A marked down and destroyed the Pipe with B. 
> 2. Let us assume B having: cs = 100, in_seq = 500 
> 3. The connection is established with cs = 101 
> 4. For whatever reason, A came across a failure during read and 
> issued a new connection request with cs = 102, currently A has out_seq 
> = 0 (as it is a brand new Pipe) and cs = 102. 
> 5. B accepted the connection request and responded back in_seq = 500 
> (already wrong here) 
> 6. A got the in_seq and comparing with its internal out_seq and 
> out_q, crashed with assertion failure. 
> 
> If this is the case, it seems one step was missed during seq 
> negotiation, that is, when B tried to do a new connection and detected 
> that A has a reset, it should reset its in_seq as well? 
> 
> Thanks, 
> Guang 
> 
> ________________________________ 
> Bug #10080: Pipe::connect() cause osd crash when osd reconnect to its 
> peer<http://tracker.ceph.com/issues/10080#change-44767> 
> 
> * Author: Wenjun Huang 
> * Status: New 
> * Priority: High 
> * Assignee: 
> * Category: msgr 
> * Target version: 
> * Source: Community (user) 
> * Backport: 
> * Tags: 
> * Severity: 3 - minor 
> * Reviewed: 
> * Suite: 
> 
> When our cluster load is heavy, the osd sometimes crashes. The critical 
> log is as below: 
> 
> 278> 2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 
> OSD::ms_get_authorizer type=osd 
> -277> 2014-08-20 11:04:28.609783 7f89636c8700 2 - 
> 10.193.207.117:6816/44281>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 
> sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). got newly_acked_seq 
> 546 vs out_seq 0 
> 276> 2014-08-20 11:04:28.609810 7f89636c8700 2 - 
> 10.193.207.117:6816/44281>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 
> sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding 
> previously sent 1 osd_map(727..755 src has 1..755) v3 
> 275> 2014-08-20 11:04:28.609859 7f89636c8700 2 - 
> 10.193.207.117:6816/44281>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 
> sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding 
> previously sent 2 pg_notify(1.2b(22),2.2c(23) epoch 755) v5 
> 
> 2014-08-20 11:04:28.608141 7f89629bb700 0 -- 10.193.207.117:6816/44281 
>>> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=134 :6816 s=2 
> pgs=236754 cs=3 l=0 c=0x44318c0).fault, initiating reconnect 
> 2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 
> OSD::ms_get_authorizer type=osd 
> 2014-08-20 11:04:28.666895 7f89636c8700 -1 msg/Pipe.cc: In function 
> 'int Pipe::connect()' thread 7f89636c8700 time 2014-08-20 
> 11:04:28.618536 msg/Pipe.cc: 1080: FAILED assert(m) 
> 
> Looking into the log, we can see, the out_seq is 0. As our cluster has 
> enabled the cephx authorization, from the source code, I am informed 
> that the out_seq is initialized by a random number. So there should be 
> some bugs in the source code. 
> 
> We face the crash issue, almost every time our cluster load is heavy. 
> So, I think it is a critical bug for ceph. 
> 
> ________________________________ 
> 
> You have received this notification because you have either subscribed 
> to it, or are involved in it. 
> To change your notification preferences, please click here: 
> http://tracker.ceph.com/my/account 
 		 	   		  ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f