Hi Joao: > Basically, this would be something similar to heartbeats. If a monitor can't > reach all monitors in an existing quorum, then just don't do anything. Based on your solution, I make a little change : - send a probe to all monitors - if get a quorum , it will join current quorum through join_quorum message, when leader receive this , it will change the quorum and claim victory again, If timeout , it means it can't reach leader , do nothing and try later from bootstrap , - if get > 1/2 acks, do as before, call election With this , sometimes the leader do not have the smallest rank num , I think this is fine. In quorum message , there will be one more byte to point out the leader rank num . I think this will perform as same as before and can tolerate some network partition error, and it only need to change little code, any suggesstion for this ? Do I lack of any considerations ? On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote: > On 07/05/2017 08:01 AM, Z Will wrote: >> >> Hi Joao: >> I think this is all because we choose the monitor with the >> smallest rank number to be leader. For this kind of network error, no >> matter which mon has lost connection with the mon who has the >> smallest rank num , will be constantly calling an election, that say >> ,will constantly affact the cluster until it is stopped by human . So >> do you think it make sense if I try to figure out a way to choose the >> monitor who can see the most monitors , or with the smallest rank >> num if the view num is same , to be leader ? >> In probing phase: >> they will know there own view, so can set a view num. >> In election phase: >> they send the view num , rank num . >> when receiving the election message, it compare the view num ( >> higher is leader ) and rank num ( lower is leader). > > > As I understand it, our elector trades-off reliability in case of network > failure for expediency in forming a quorum. This by itself is not a problem > since we don't see many real-world cases where this behaviour happens, and > we are a lot more interested in making sure we have a quorum - given without > a quorum your cluster is effectively unusable. > > Currently, we form a quorum with a minimal number of messages passed. > From my poor recollection, I think the Elector works something like > > - 1 probe message to each monitor in the monmap > - receives defer from a monitor, or defers to a monitor > - declares victory if number of defers is an absolute majority (including > one's defer). > > An election cycle takes about 4-5 messages to complete, with roughly two > round-trips (in the best case scenario). > > Figuring out which monitor is able to contact the highest number of > monitors, and having said monitor being elected the leader, will necessarily > increase the number of messages transferred. > > A rough idea would be > > - all monitors will send probes to all other monitors in the monmap; > - all monitors need to ack the other's probes; > - each monitor will count the number of monitors it can reach, and then send > a message proposing itself as the leader to the other monitors, with the > list of monitors they see; > - each monitor will propose itself as the leader, or defer to some other > monitor. > > This is closer to 3 round-trips. > > Additionally, we'd have to account for the fact that some monitors may be > able to reach all other monitors, while some may only be able to reach a > portion. How do we handle this scenario? > > - What do we do with monitors that do not reach all other monitors? > - Do we ignore them for electoral purposes? > - Are they part of the final quorum? > - What if we need those monitors to form a quorum? > > Personally, I think the easiest solution to this problem would be > blacklisting a problematic monitor (for a given amount a time, or until a > new election is needed due to loss of quorum, or by human intervention). > > For example, if a monitor believes it should be the leader, and if all other > monitors are deferring to someone else that is not reachable, the monitor > could then enter a special case branch: > > - send a probe to all monitors > - receive acks > - share that with other monitors > - if that list is missing monitors, then blacklist the monitor for a period, > and send a message to that monitor with that decision > - the monitor would blacklist itself and retry in a given amount of time. > > Basically, this would be something similar to heartbeats. If a monitor can't > reach all monitors in an existing quorum, then just don't do anything. > > In any case, you are more than welcome to propose a solution. Let us know > what you come up with and if you want to discuss this a bit more ;) > > -Joao > >> >> On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote: >>> >>> On 07/04/2017 06:57 AM, Z Will wrote: >>>> >>>> >>>> Hi: >>>> I am testing ceph-mon brain split . I have read the code . If I >>>> understand it right , I know it won't be brain split. But I think >>>> there is still another problem. My ceph version is 0.94.10. And here >>>> is my test detail : >>>> >>>> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1 >>>> mon , and use iptables to block the communication between mon 0 and >>>> mon 1. When the cluster is stable, start mon.1 . I found the 3 >>>> monitors will all can not work well. They are all trying to call new >>>> leader election . This means the cluster can't work anymore. >>>> >>>> Here is my analysis. Because mon will always respond to leader >>>> election message, so , in my test, communication between mon.0 and >>>> mon.1 is blocked , so mon.1 will always try to be leader, because it >>>> will always see mon.2, and it should win over mon.2. Mon.0 should >>>> always win over mon.2. But mon.2 will always responsd to the election >>>> message issued by mon.1, so this loop will never end. Am I right ? >>>> >>>> This should be a problem? Or is it was just designed like this , and >>>> should be handled by human ? >>> >>> >>> >>> This is a known behaviour, quite annoying, but easily identifiable by >>> having >>> the same monitor constantly calling an election and usually timing out >>> because the peon did not defer to it. >>> >>> In a way, the elector algorithm does what it is intended to. Solving this >>> corner case would be nice, but I don't think there's a good way to solve >>> it. >>> We may be able to presume a monitor is in trouble during the probe phase, >>> to >>> disqualify a given monitor from the election, but in the end this is a >>> network issue that may be transient or unpredictable and there's only so >>> much we can account for. >>> >>> Dealing with it automatically would be nice, but I think, thus far, the >>> easiest way to address this particular issue is human intervention. >>> >>> -Joao >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com