Re: Upper limit of MONs and MDSs in a Cluster

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 25 May 2017 18:13:06 +0000

You absolutely cannot do this with your monitors -- as David says every node would have to participate in every monitor decision; the long tails would be horrifying and I expect it would collapse in ignominious defeat very quickly.

Your MDSes should be fine since they are indeed just a bunch of standby daemons at that point. You'd want to consider how that fits with your RAM requirements though; it's probably not a good deployment decision even though it would work at the daemon level.
-Greg

On Thu, May 25, 2017 at 8:30 AM David Turner <drakonstein@xxxxxxxxx> wrote:
For the MDS, the primary doesn't hold state data that needs to be replayed to a standby.  The information exists in the cluster.  Your setup would be 1 Active, 100 Standby.  If the active went down, 1 of the standby's would be promoted and read the information from the cluster.
With Mons, it's interesting because of the quorum mechanics.  4 mons is worse than 3 mons because of the chance for split brain where 2 of them think something is right and the other 2 think it's wrong.  You have no tie breaking vote.  Odd numbers are always best and it seems like your proposal would regularly have an even number of Mons.  I haven't heard of a deployment with more than 5 mons.  I would imagine there are some with 7 mons out there, but it's not worth the hardware expense in 99.999% of cases.

I'm assuming your question comes from a place of wanting to have 1 configuration to rule them all and not have multiple types of nodes in your ceph deployment scripts.  Just put in the time and do it right.  Have MDS servers, have Mons, have OSD nodes, etc.  Once you reach scale, your mons are going to need their resources, your OSDs are going to need theirs, your RGW will be using more bandwidth, ad infinitum.  That isn't to mention all of the RAM that the services will need during any recovery (assume 3x memory requirements for most Ceph services when recovering.

Hyper converged clusters are not recommended for production deployments.  Several people use them, but generally for smaller clusters.  By the time you reach dozens and hundreds of servers, you will only cause yourself headaches by becoming the special snowflake in the community.  Every time you have a problem, the first place to look will be your resource contention between Ceph daemons.

Back to some of your direct questions.  Not having tested this, but using educated guesses... A possible complication of having 100's of Mons would be that they all have to agree on a new map causing a LOT more communication between your mons which could likely lead to a bottleneck for map updates (snapshot creation/deletion, osds going up/down, scrubs happening, anything that affects data in a map).  When an MDS fails, I don't know how the voting would go for choosing a new Active MDS among 100 Stand-by's.  That could either go very quickly or take quite a bit longer depending on the logic behind the choice.  100's of RGW servers behind an LB (I'm assuming) would negate any caching that is happening on the RGW servers as multiple accesses to the same file will not likely reach the same RGW.

On Thu, May 25, 2017 at 10:40 AM Wes Dillingham <wes_dillingham@xxxxxxxxxxx> wrote:
How much testing has there been / what are the implications of having a large number of Monitor and Metadata daemons running in a cluster. 
Thus far I  have deployed all of our Ceph clusters as a single service type per physical machine but I am interested in a use case where we deploy dozens/hundreds? of boxes each of which would be a mon,mds,mgr,osd,rgw all in one and all a single cluster. I do realize it is somewhat trivial (with config mgmt and all) to dedicate a couple of lean boxes as MDS's and MONs and only expand at the OSD level but I'm still curious.

My use case in mind is for backup targets where pools span the entire cluster and am looking to streamline the process for possible rack and stack situations where boxes can just be added in place booted up and they auto-join the cluster as a mon/mds/mgr/osd/rgw.

So does anyone run clusters with dozen's of MONs' and/or MDS or aware of any testing with very high numbers of each? At the MDS level I would just be looking for 1 Active, 1 Standby-replay and X standby until multiple active MDSs are production ready. Thanks!

-- 
Respectfully,
Wes Dillingham
wes_dillingham@xxxxxxxxxxx
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 102

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com