RE: 答复: 2 replications,flapping can not stop for a very long time

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Mon, 14 Sep 2015 07:20:54 +0000

This is kind of unsolvable problem, in CAP , we choose Consistency and Availability, thus we had to lose Partition tolerance.

There are three networks here , mon<-> osd, osd<-public->osd, osd<- cluster-> osd. If some of the networks are reachable but some are not, likely the flipping will happen.  

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of huang jun
Sent: Sunday, September 13, 2015 5:46 PM
To: zhao.mingyue@xxxxxxx
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: 答复: 2 replications,flapping can not stop for a very long time

2015-09-13 14:07 GMT+08:00 zhao.mingyue@xxxxxxx <zhao.mingyue@xxxxxxx>:
> hi, do you set both public_network and cluster_network, but just cut off the cluster_network?
> And do you have not only one osd on the same host?
> =============================yes,public network+cluster network,and I 
> cut off the cluster network; 2 node ,each node has serveral osds;
>
> If so, maybe you can not get stable, now the osd have peers in the prev and next osd id, they can exchange ping message.
> you cut off the cluster_network, the outbox peer osds can not detect the ping, they reports the osd failure to MON, and MON gather enough reporters and reports, then the osd will be marked down.
> =============================when osd recv a new map and it is marked down,it think MON wrongly mark me down,what will it do,join the cluster again or other actions?can you give me some more detailed explanation?

It will send a boot message to MON, and will be marked UP by MON.
>
> But the osd can reports to MON bc the public_network is ok,  MON thinks the osd wronly marked down, mark it to UP.
> =============================you mean that MON recv message ONE TIME from this osd then it will mark this osd up?
>

> So flapping happens again and again.
> ============================= I tried 3 replications,(public network + 
> cluster network,3 node,each node has serveral osds),although it will 
> occur flapping,but after serveral minutes it will be stable, compared with 2 replications situation, I wait for the same intervals,the cluster can not be stable; so I'm confused about the machnism that how monitor can decide which osd is actually down?
>
It's weird, if you cut off the cluster_network, the osds in other node can not get the ping messages, and naturally think the osd is failed.

> thanks
>
> -----邮件原件-----
> 发件人: huang jun [mailto:hjwsm1989@xxxxxxxxx]
> 发送时间: 2015年9月13日 10:39
> 收件人: zhaomingyue 09440 (RD)
> 抄送: ceph-devel@xxxxxxxxxxxxxxx
> 主题: Re: 2 replications,flapping can not stop for a very long time
>
> hi, do you set both public_network and cluster_network, but just cut off the cluster_network?
> And do you have not only one osd on the same host?
> If so, maybe you can not get stable, now the osd have peers in the prev and next osd id, they can exchange ping message.
> you cut off the cluster_network, the outbox peer osds can not detect the ping, they reports the osd failure to MON, and MON gather enough reporters and reports, then the osd will be marked down.
> But the osd can reports to MON bc the public_network is ok,  MON thinks the osd wronly marked down, mark it to UP.
> So flapping happens again and again.
>
> 2015-09-12 20:26 GMT+08:00 zhao.mingyue@xxxxxxx <zhao.mingyue@xxxxxxx>:
>>
>> Hi,
>> I'm testing reliability of ceph recently, and I have met the flapping problem.
>> I have 2 replications, and cut off the cluster network ,now  flapping can not stop,I have wait more than 30min, but status of osds are still not stable;
>>     I want to know about  when monitor recv reports from osds ,how it can mark one osd down?
>>     (reports && reporter && grace) need to satisfied some conditions, how to calculate the grace?
>> and how long will the flapping  stop?Does the flapping must be stopped by configure,such as configure an osd lost?
>> Can someone help me ?
>> Thanks~
>> ---------------------------------------------------------------------
>> -
>> ---------------------------------------------------------------
>> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
>> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
>> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
>> 邮件！
>> This e-mail and its attachments contain confidential information from 
>> H3C, which is intended only for the person or entity whose address is 
>> listed above. Any use of the information contained herein in any way 
>> (including, but not limited to, total or partial disclosure, 
>> reproduction, or dissemination) by persons other than the intended
>> recipient(s) is prohibited. If you receive this e-mail in error, 
>> please notify the sender by phone or email immediately and delete it!
>
>
>
> --
> thanks
> huangjun

--
thanks
huangjun
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f