question about monitor and paxos relationship

joao.luis@xxxxxxxxxxx (Joao Eduardo Luis) · Sun, 31 Aug 2014 03:40:13 +0100

Nigel mistakenly replied just to me, CC'ing the list.

On 08/30/2014 08:12 AM, Nigel Williams wrote:
> On Sat, Aug 30, 2014 at 11:59 AM, Joao Eduardo Luis
> <joao.luis at inktank.com> wrote:
>> But yeah, if you're going with 2 or 4, you'll be better off with 3 or 5.  As
>> long as you don't go with 1 you should be okay.
>
> On a recent panel discussion one member strongly advocated 5 as the
> minimum number of MONs for a large Ceph deployment. Large in this case
> was PBs of storage.
>
> For a Ceph cluster with 100s of OSDs and 100s of TB across multiple
> racks (therefore many paths involved) is 5 x MONs a good rule-of-thumb
> or is three sufficient?

Whoever stated that was probably right.  I don't often like to speak 
about what works best for (really) large deployments as I don't often 
see them.  In theory, 5 monitors will fare better than 3 for 100s of OSDs.

As far as the monitors are concerned, this will be so mostly because 5 
monitors are able to serve more maps concurrently than 3 monitors would. 
  I don't think we have tests to back my reasoning here, but I don't 
think that the cluster workload or its size would have that much of an 
impact on the number of monitors.  Albeit a technical detail, the fact 
is that every message that an OSD would send to a monitor that would 
trigger an update to a map is *always* forwarded to the leader monitor. 
  This means that regardless of how many monitors you have, you'll 
always end up with the same monitor dealing with the map updates and 
that always puts a cap on map update throughput -- this is not that big 
of a deal, usually, and knobs may be adjusted if need be.

On the other hand, given you have 5 monitors instead of 3 means that 
you'll be able to spread OSD connections throughout more monitors, and 
even if updates are forwarded to the leader, connection-wise the load is 
more spread out -- the message is forwarded by the monitor the OSD 
connects to, and said monitor will act as a proxy in replying to the 
OSD, so there's less hammering the leader directly.

But the point where this actually may make a real difference is in 
serving osdmap updates.  So, the OSDs need those updates.  Even 
considering that OSDs will share maps amongst themselves, they still 
need to get them from somewhere -- and that somewhere is the monitor 
cluster.  If you have 100s of OSDs connected to just 3 monitors, each 
monitor will end up serving bunches of reads (sending map updates to 
OSDs) while dealing with messages that will trigger map updates (which 
will in turn be forwarded to the leader).  Given that any client (OSDs 
included) connect to monitors at random at start and maintain that 
connection for a while, a "rule of thumb" would tell us that the leader 
would be responsible for serving 1/3 of all map reads while still 
handling map updates.  Having 5 monitors would reduce this load to 1/5.

However, I don't know of a good indicator to whether a given cluster 
should go with 5 monitors instead of 3.  Or 7 monitors instead of 5.  I 
don't think there are many clusters running 7 monitors, but it may so be 
that for even larger clusters, having 5 or 7 monitors serving updates 
makes up for the increased number of messages required to commit an 
update -- keep in mind that due to Paxos nature one always needs an ack 
for an update from at least (N+1)/2 monitors.  Again, this is twofold: 
we may have more messages being passed around, but given each monitor is 
under a lower load we may even get to them faster.

I think I went a bit offtrack.

Let me know if this led to further confusion instead.

   -Joao

-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com