question about monitor and paxos relationship

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08/30/2014 08:03 AM, pragya jain wrote:
> Thanks Greg, Joao and David,
>
> The concept why odd no. of monitors are preferred is clear to me, but
> still I am not clear about the working of Paxos algorithm:
>
> #1. All changes in any data structure of monitor whether it is monitor
> map, OSD map, PG map, MDS map or CRUSH map; are made through Paxos
> algorithm and
> #2. Paxos algorithm also establish a quorum among the monitors for
> recent copy of cluster map.
>
> I am unable to understand how these two things are related and connected
> ? how does Paxos provide these two functionalities?

As Greg mentioned before, Paxos is a consensus algorithm thus we can 
leverage Paxos for anything that may require consensus.

We have two portions of the monitors that will use a modified version of 
Paxos (but still Paxos in nature): map consensus and elections.

Let me give you a (rough) temporal view of how the monitor applies this 
once it starts.  Say you have 5 monitors total, 2 of which are down.

1. Alive monitors will "probe" all monitors in the monmap (all other 4 
of them) -- the probing phase is independent from anything-Paxos and is 
meant to raise awareness to the monitors that are up, alive and reachable.

2. Once enough monitors to form a quorum (i.e., at least (N+1)/2) reply 
to the probes, the monitors will enter the election phase.

3. The election phase is a stripped-down version of Paxos and goes 
something like this:
   - mon.a has rank 0 and thinks it must be the leader
   - mon.b has rank 1 and thinks it must be the leader
   - mon.c has rank 2 and thinks it must be the leader

   - mon.a receives mon.b's and mon.c's leader proposals and ignores 
them as mon.a has a higher rank than mon.b or mon.c (lowest the value, 
highest the rank)

   - mon.c receives mon.a's leader proposal and defers to mon.a (a's 
rank 0 > c's rank 2).
   - mon.c receives mon.b's leader proposal and ignores as it has 
already deferred to a monitor with higher rank than b's (a's rank 0 > 
b's rank 1).

   - mon.b receives mon.a's leader proposal and defers to mon.a (a's 
rank 0 > b's rank 2).

   - mon.a got 3 accepts (mon.a's + mon.b's + mon.c's), which is a 
absolute majority (3 == (N+1)/2, for N = 5).  mon.a declares itself the 
leader, every other monitor declares itself a peon.

The election phase follows Paxos 'prepare', 'promise', 'accept' and 
'accepted' phases.

Same goes for maps.  Once the leader has been elected and the peons 
established we can state that a quorum was reached.  The quorum is the 
set of all monitors participating in the cluster, and in this case the 
quorum will be { mon.a, mon.b, mon.c }.  After a quorum has been 
established the monitors will be able to allow map modifications as needed.

So say a new OSD is added to the cluster.  The osdmap needs to reflect 
this.  The leader handles the modification and keeps it on a temporary, 
to-be-committed osdmap, and proposes the changes to all monitors in the 
quorum.

1. Leader proposes the modification to all quorum participants.  Each 
modification is packed with a version and a proposal number.

2. Each monitor will check if it has seen said proposal number before. 
If not it will take the proposal from the leader, stash it on disk on a 
temporary location, and will let the leader that it has been accepted. 
If on the other hand the monitor sees that said proposal number has been 
proposed before, then it will not accept the proposal and simply ignore 
the leader.

3. The leader will collect all 'accepts' from peons.  If (N+1)/2 
monitors (counting with the leader, which accepts its proposals by 
default) accepted the proposal, then the leader will issue a 'commit' 
instructing everyone to move the proposal from its temporary location to 
its final location (for instance, from 'stashed_proposal' to 
'osdmap:version_10').  If by chance not enough monitors accepted the 
proposal (i.e., less than (N+1)/2), eventually a timeout will be 
triggered and the quorum will undergo a new election.

This also follows Paxos 'prepare', 'promise', 'accept' and 'accepted' 
phases, even if we cut corners to reduce message passing.

Hope this helps.

   -Joao

>
> Please help to clarify these points.
>
> Regards
> Pragya Jain
>
>
>
>
> On Saturday, 30 August 2014 7:29 AM, Joao Eduardo Luis
> <joao.luis at inktank.com> wrote:
>
>
>
>     On 08/29/2014 11:22 PM, J David wrote:
>
>      > So an even number N of monitors doesn't give you any better fault
>      > resilience than N-1 monitors.  And the more monitors you have, the
>      > more traffic there is between them.  So when N is even, N monitors
>      > consume more resources and provide no extra benefit compared to N-1
>      > monitors.
>
>
>     Except for more copies ;)
>
>     But yeah, if you're going with 2 or 4, you'll be better off with 3
>     or 5.
>        As long as you don't go with 1 you should be okay.  Only go with
>     1 if
>     you're truly okay with losing whatever you're storing if that one
>     monitor's disk is fried.
>
>        -Joao
>
>
>     --
>     Joao Eduardo Luis
>     Software Engineer | http://inktank.com <http://inktank.com/>|
>     http://ceph.com <http://ceph.com/>
>
>
>


-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux