Re: ceph-mon leader election problem, should it be improved ?

Z Will <zhao6305@xxxxxxxxx> · Tue, 11 Jul 2017 11:25:52 +0800

Hi Joao:

    > Basically, this would be something similar to heartbeats. If a
monitor can't
    > reach all monitors in an existing quorum, then just don't do anything.

     Based on your solution, I make a little change :
     - send a probe to all monitors
     - if  get a quorum ,
             it will join current quorum through join_quorum message,
when leader  receive this , it will change the quorum and
 claim victory again,
             If timeout , it means it can't reach leader , do nothing
and try later from bootstrap ,
     - if  get > 1/2 acks, do as before, call election

     With this , sometimes the leader do not have  the  smallest rank
num , I think this is fine. In quorum message , there will be one more
byte to point out the leader rank num .
     I think this will perform as same as before and can tolerate some
network partition error, and it only need to change little code,  any
suggesstion for this ? Do I lack of any  considerations ?

On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote:
> On 07/05/2017 08:01 AM, Z Will wrote:
>>
>> Hi Joao:
>>     I think this is all because we choose the monitor with the
>> smallest rank number to be leader. For this kind of network error, no
>> matter which mon has lost connection with the  mon who has the
>> smallest rank num , will be constantly calling an election, that say
>> ,will constantly affact the cluster until it is stopped by human . So
>> do you think it make sense if I try to figure out a way to choose the
>> monitor who can see the most monitors ,  or with  the smallest rank
>> num if the view num is same , to be leader ?
>>     In probing phase:
>>        they will know there own view, so can set a view num.
>>     In election phase:
>>        they send the view num , rank num .
>>        when receiving the election message, it compare the view num (
>> higher is leader ) and rank num ( lower is leader).
>
>
> As I understand it, our elector trades-off reliability in case of network
> failure for expediency in forming a quorum. This by itself is not a problem
> since we don't see many real-world cases where this behaviour happens, and
> we are a lot more interested in making sure we have a quorum - given without
> a quorum your cluster is effectively unusable.
>
> Currently, we form a quorum with a minimal number of messages passed.
> From my poor recollection, I think the Elector works something like
>
> - 1 probe message to each monitor in the monmap
> - receives defer from a monitor, or defers to a monitor
> - declares victory if number of defers is an absolute majority (including
> one's defer).
>
> An election cycle takes about 4-5 messages to complete, with roughly two
> round-trips (in the best case scenario).
>
> Figuring out which monitor is able to contact the highest number of
> monitors, and having said monitor being elected the leader, will necessarily
> increase the number of messages transferred.
>
> A rough idea would be
>
> - all monitors will send probes to all other monitors in the monmap;
> - all monitors need to ack the other's probes;
> - each monitor will count the number of monitors it can reach, and then send
> a message proposing itself as the leader to the other monitors, with the
> list of monitors they see;
> - each monitor will propose itself as the leader, or defer to some other
> monitor.
>
> This is closer to 3 round-trips.
>
> Additionally, we'd have to account for the fact that some monitors may be
> able to reach all other monitors, while some may only be able to reach a
> portion. How do we handle this scenario?
>
> - What do we do with monitors that do not reach all other monitors?
> - Do we ignore them for electoral purposes?
> - Are they part of the final quorum?
> - What if we need those monitors to form a quorum?
>
> Personally, I think the easiest solution to this problem would be
> blacklisting a problematic monitor (for a given amount a time, or until a
> new election is needed due to loss of quorum, or by human intervention).
>
> For example, if a monitor believes it should be the leader, and if all other
> monitors are deferring to someone else that is not reachable, the monitor
> could then enter a special case branch:
>
> - send a probe to all monitors
> - receive acks
> - share that with other monitors
> - if that list is missing monitors, then blacklist the monitor for a period,
> and send a message to that monitor with that decision
> - the monitor would blacklist itself and retry in a given amount of time.
>
> Basically, this would be something similar to heartbeats. If a monitor can't
> reach all monitors in an existing quorum, then just don't do anything.
>
> In any case, you are more than welcome to propose a solution. Let us know
> what you come up with and if you want to discuss this a bit more ;)
>
>   -Joao
>
>>
>> On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote:
>>>
>>> On 07/04/2017 06:57 AM, Z Will wrote:
>>>>
>>>>
>>>> Hi:
>>>>    I am testing ceph-mon brain split . I have read the code . If I
>>>> understand it right , I know it won't be brain split. But I think
>>>> there is still another problem. My ceph version is 0.94.10. And here
>>>> is my test detail :
>>>>
>>>> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
>>>> mon , and use iptables to block the communication between mon 0 and
>>>> mon 1. When the cluster is stable, start mon.1 .  I found the 3
>>>> monitors will all can not work well. They are all trying to call  new
>>>> leader  election . This means the cluster can't work anymore.
>>>>
>>>> Here is my analysis. Because mon will always respond to leader
>>>> election message, so , in my test, communication between  mon.0 and
>>>> mon.1 is blocked , so mon.1 will always try to be leader, because it
>>>> will always see mon.2, and it should win over mon.2. Mon.0 should
>>>> always win over mon.2. But mon.2 will always responsd to the election
>>>> message issued by mon.1, so this loop will never end. Am I right ?
>>>>
>>>> This should be a problem? Or is it  was just designed like this , and
>>>> should be handled by human ?
>>>
>>>
>>>
>>> This is a known behaviour, quite annoying, but easily identifiable by
>>> having
>>> the same monitor constantly calling an election and usually timing out
>>> because the peon did not defer to it.
>>>
>>> In a way, the elector algorithm does what it is intended to. Solving this
>>> corner case would be nice, but I don't think there's a good way to solve
>>> it.
>>> We may be able to presume a monitor is in trouble during the probe phase,
>>> to
>>> disqualify a given monitor from the election, but in the end this is a
>>> network issue that may be transient or unpredictable and there's only so
>>> much we can account for.
>>>
>>> Dealing with it automatically would be nice, but I think, thus far, the
>>> easiest way to address this particular issue is human intervention.
>>>
>>>   -Joao
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html