question about monitor and paxos relationship

scott@xxxxxxxxxxx (Scott Laird) · Sun, 31 Aug 2014 23:59:02 +0000

If you want your data to be N+2 redundant (able to handle 2 failures, more
or less), then you need to set size=3 and have 3 replicas of your data.

If you want your monitors to be N+2 redundant, then you need 5 monitors.

If you feel that your data is worth size=3, then you should really try to
have 5 monitors.  Unless you're building a cluster with <5 servers, of
course.

This is common to pretty much every quorum-based system in existence, not
just Ceph.  In my experience, 1 replica is fine for test instances that
have no expectation of data persistence or availability, 3 replicas is okay
for small instances that don't need any sort of strong availability
guarantee, and 5 replicas is really where you need to be for any sort of
large-scale production use.  I've been stuck using 3-way replicated quorum
systems in large-scale production systems, and it made any sort of planned
maintenance absolutely terrifying.  Or really any back-end outage at all,
because you're left operating completely without a net.  Any additional
failure and the service craters spectacularly and publicly.  Since I really
hate reading newspaper articles about outages in my systems, I use 5-way
quorums whenever possible.

Scott

On Sat Aug 30 2014 at 7:40:18 PM Joao Eduardo Luis <joao.luis at inktank.com>
wrote:

> Nigel mistakenly replied just to me, CC'ing the list.
>
> On 08/30/2014 08:12 AM, Nigel Williams wrote:
> > On Sat, Aug 30, 2014 at 11:59 AM, Joao Eduardo Luis
> > <joao.luis at inktank.com> wrote:
> >> But yeah, if you're going with 2 or 4, you'll be better off with 3 or
> 5.  As
> >> long as you don't go with 1 you should be okay.
> >
> > On a recent panel discussion one member strongly advocated 5 as the
> > minimum number of MONs for a large Ceph deployment. Large in this case
> > was PBs of storage.
> >
> > For a Ceph cluster with 100s of OSDs and 100s of TB across multiple
> > racks (therefore many paths involved) is 5 x MONs a good rule-of-thumb
> > or is three sufficient?
>
> Whoever stated that was probably right.  I don't often like to speak
> about what works best for (really) large deployments as I don't often
> see them.  In theory, 5 monitors will fare better than 3 for 100s of OSDs.
>
> As far as the monitors are concerned, this will be so mostly because 5
> monitors are able to serve more maps concurrently than 3 monitors would.
>   I don't think we have tests to back my reasoning here, but I don't
> think that the cluster workload or its size would have that much of an
> impact on the number of monitors.  Albeit a technical detail, the fact
> is that every message that an OSD would send to a monitor that would
> trigger an update to a map is *always* forwarded to the leader monitor.
>   This means that regardless of how many monitors you have, you'll
> always end up with the same monitor dealing with the map updates and
> that always puts a cap on map update throughput -- this is not that big
> of a deal, usually, and knobs may be adjusted if need be.
>
> On the other hand, given you have 5 monitors instead of 3 means that
> you'll be able to spread OSD connections throughout more monitors, and
> even if updates are forwarded to the leader, connection-wise the load is
> more spread out -- the message is forwarded by the monitor the OSD
> connects to, and said monitor will act as a proxy in replying to the
> OSD, so there's less hammering the leader directly.
>
> But the point where this actually may make a real difference is in
> serving osdmap updates.  So, the OSDs need those updates.  Even
> considering that OSDs will share maps amongst themselves, they still
> need to get them from somewhere -- and that somewhere is the monitor
> cluster.  If you have 100s of OSDs connected to just 3 monitors, each
> monitor will end up serving bunches of reads (sending map updates to
> OSDs) while dealing with messages that will trigger map updates (which
> will in turn be forwarded to the leader).  Given that any client (OSDs
> included) connect to monitors at random at start and maintain that
> connection for a while, a "rule of thumb" would tell us that the leader
> would be responsible for serving 1/3 of all map reads while still
> handling map updates.  Having 5 monitors would reduce this load to 1/5.
>
> However, I don't know of a good indicator to whether a given cluster
> should go with 5 monitors instead of 3.  Or 7 monitors instead of 5.  I
> don't think there are many clusters running 7 monitors, but it may so be
> that for even larger clusters, having 5 or 7 monitors serving updates
> makes up for the increased number of messages required to commit an
> update -- keep in mind that due to Paxos nature one always needs an ack
> for an update from at least (N+1)/2 monitors.  Again, this is twofold:
> we may have more messages being passed around, but given each monitor is
> under a lower load we may even get to them faster.
>
> I think I went a bit offtrack.
>
> Let me know if this led to further confusion instead.
>
>    -Joao
>
>
> --
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140831/5e11196e/attachment.htm>