thanks all, for your great explanation. Regards Pragya Jain On Saturday, 30 August 2014 4:51 PM, Joao Eduardo Luis <joao.luis at inktank.com> wrote: > > >On 08/30/2014 08:03 AM, pragya jain wrote: >> Thanks Greg, Joao and David, >> >> The concept why odd no. of monitors are preferred is clear to me, but >> still I am not clear about the working of Paxos algorithm: >> >> #1. All changes in any data structure of monitor whether it is monitor >> map, OSD map, PG map, MDS map or CRUSH map; are made through Paxos >> algorithm and >> #2. Paxos algorithm also establish a quorum among the monitors for >> recent copy of cluster map. >> >> I am unable to understand how these two things are related and connected >> ? how does Paxos provide these two functionalities? > >As Greg mentioned before, Paxos is a consensus algorithm thus we can >leverage Paxos for anything that may require consensus. > >We have two portions of the monitors that will use a modified version of >Paxos (but still Paxos in nature): map consensus and elections. > >Let me give you a (rough) temporal view of how the monitor applies this >once it starts. Say you have 5 monitors total, 2 of which are down. > >1. Alive monitors will "probe" all monitors in the monmap (all other 4 >of them) -- the probing phase is independent from anything-Paxos and is >meant to raise awareness to the monitors that are up, alive and reachable. > >2. Once enough monitors to form a quorum (i.e., at least (N+1)/2) reply >to the probes, the monitors will enter the election phase. > >3. The election phase is a stripped-down version of Paxos and goes >something like this: > - mon.a has rank 0 and thinks it must be the leader > - mon.b has rank 1 and thinks it must be the leader > - mon.c has rank 2 and thinks it must be the leader > > - mon.a receives mon.b's and mon.c's leader proposals and ignores >them as mon.a has a higher rank than mon.b or mon.c (lowest the value, >highest the rank) > > - mon.c receives mon.a's leader proposal and defers to mon.a (a's >rank 0 > c's rank 2). > - mon.c receives mon.b's leader proposal and ignores as it has >already deferred to a monitor with higher rank than b's (a's rank 0 > >b's rank 1). > > - mon.b receives mon.a's leader proposal and defers to mon.a (a's >rank 0 > b's rank 2). > > - mon.a got 3 accepts (mon.a's + mon.b's + mon.c's), which is a >absolute majority (3 == (N+1)/2, for N = 5). mon.a declares itself the >leader, every other monitor declares itself a peon. > >The election phase follows Paxos 'prepare', 'promise', 'accept' and >'accepted' phases. > >Same goes for maps. Once the leader has been elected and the peons >established we can state that a quorum was reached. The quorum is the >set of all monitors participating in the cluster, and in this case the >quorum will be { mon.a, mon.b, mon.c }. After a quorum has been >established the monitors will be able to allow map modifications as needed. > >So say a new OSD is added to the cluster. The osdmap needs to reflect >this. The leader handles the modification and keeps it on a temporary, >to-be-committed osdmap, and proposes the changes to all monitors in the >quorum. > >1. Leader proposes the modification to all quorum participants. Each >modification is packed with a version and a proposal number. > >2. Each monitor will check if it has seen said proposal number before. >If not it will take the proposal from the leader, stash it on disk on a >temporary location, and will let the leader that it has been accepted. >If on the other hand the monitor sees that said proposal number has been >proposed before, then it will not accept the proposal and simply ignore >the leader. > >3. The leader will collect all 'accepts' from peons. If (N+1)/2 >monitors (counting with the leader, which accepts its proposals by >default) accepted the proposal, then the leader will issue a 'commit' >instructing everyone to move the proposal from its temporary location to >its final location (for instance, from 'stashed_proposal' to >'osdmap:version_10'). If by chance not enough monitors accepted the >proposal (i.e., less than (N+1)/2), eventually a timeout will be >triggered and the quorum will undergo a new election. > >This also follows Paxos 'prepare', 'promise', 'accept' and 'accepted' >phases, even if we cut corners to reduce message passing. > >Hope this helps. > > -Joao > >> >> Please help to clarify these points. >> >> Regards >> Pragya Jain >> >> >> >> >> On Saturday, 30 August 2014 7:29 AM, Joao Eduardo Luis >> <joao.luis at inktank.com> wrote: >> >> >> >> On 08/29/2014 11:22 PM, J David wrote: >> >> > So an even number N of monitors doesn't give you any better fault >> > resilience than N-1 monitors. And the more monitors you have, the >> > more traffic there is between them. So when N is even, N monitors >> > consume more resources and provide no extra benefit compared to N-1 >> > monitors. >> >> >> Except for more copies ;) >> >> But yeah, if you're going with 2 or 4, you'll be better off with 3 >> or 5. >> As long as you don't go with 1 you should be okay. Only go with >> 1 if >> you're truly okay with losing whatever you're storing if that one >> monitor's disk is fried. >> >> -Joao >> >> >> -- >> Joao Eduardo Luis >> Software Engineer | http://inktank.com <http://inktank.com/>| >> http://ceph.com <http://ceph.com/> > >> >> >> > > >-- >Joao Eduardo Luis >Software Engineer | http://inktank.com | http://ceph.com > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140901/8d81ab36/attachment.htm>