On 19/03/15 19:47, Darren Thompson wrote: > Team > > My first though re this proposal, what mechanism is going to be in-place > to ensure that there is no "split brain" scenario. > > For smaller rings we can rely on the fall-back of a "shared media/SBD > device" to ensure that there is consistency/ > > If there is a comms interruption between ring members, is there a danger > that each reaming half will then recruit new nodes from their "satellite > spares"? Unless I've misunderstood you, I don't think this is a problem because the satellites have no votes and don't affect quorum in any way. If there is a split in the VS part of the cluster then the active partition will be determined using the normal quorum/fencing methods and any of their connected satellites will therefore also be inactive. Chrissie > Do we need to consider a mechanism to adapt the node configuration (e.g. > adding SBD devices) ir is that just going to complicate things further? > > Darren Thompson > > Professional Services Engineer / Consultant > > *cid:image001.jpg@01CB7C0C.6C6A2AE0* > > Level 3, 60 City Road > > Southgate, VIC 3006 > > Mb: 0400 640 414 > > Mail: darrent@xxxxxxxxxxxxx <mailto:steve@xxxxxxxxxxxxx> > Web: www.akurit.com.au <http://www.akurit.com.au/> > > > On 19 March 2015 at 23:00, <discuss-request@xxxxxxxxxxxx > <mailto:discuss-request@xxxxxxxxxxxx>> wrote: > > Send discuss mailing list submissions to > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.corosync.org/mailman/listinfo/discuss > or, via email, send a message with subject or body 'help' to > discuss-request@xxxxxxxxxxxx > <mailto:discuss-request@xxxxxxxxxxxx> > > You can reach the person managing the list at > discuss-owner@xxxxxxxxxxxx <mailto:discuss-owner@xxxxxxxxxxxx> > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of discuss digest..." > > > Today's Topics: > > 1. RFC: Extending corosync to high node counts (Christine Caulfield) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 Mar 2015 10:05:41 +0000 > From: Christine Caulfield <ccaulfie@xxxxxxxxxx > <mailto:ccaulfie@xxxxxxxxxx>> > To: discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> > Subject: RFC: Extending corosync to high node counts > Message-ID: <550A9F75.5010309@xxxxxxxxxx > <mailto:550A9F75.5010309@xxxxxxxxxx>> > Content-Type: text/plain; charset=utf-8 > > Extending corosync > ------------------ > > This is an idea that came out of several discussions at the cluster > summit in February. Please comment ! > > It is not meant to be a generalised solution to extending corosync for > most users. For single & double digit cluster sizes the current ring > protocols should be sufficient. This is intended to make corosync usable > over much larger node counts. > > The problem > ----------- > Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s) > This is mainly down to the requirements of virtual synchrony(VS) and the > ring protocol. > > A proposed solution > ------------------- > Have 'satellite' nodes that are not part of the ring (and do not not > participate in VS). > They communicate via a single 'host' node over (possibly) TCP. The host > sends the messages > to them in a 'send and forget' system - though TCP guaratees ordering > and delivery. > Host nodes can support many satellites. If a host goes down the > satellites can reconnect to > another node and carry on. > > Satellites have no votes, and do not participate in Virtual Synchrony. > > Satellites can send/receive CPG messages and get quorum information but > will not appear in > the quorum nodes list. > > There must be a separate nodes list for satellites, probably maintained > by a different subsystem. > Satellites will have nodeIDs (required for CPG) that do not clash with > the ring nodeids. > > > Appearance to the user/admin > ---------------------------- > corosync.conf defines which nodes are satellites and which nodes to > connect to (initially). May > want some utility to force satellites to migrate from a node if it gets > full. > > Future: Automatic configuration of who is in the VS cluster and who is a > satellite. Load balancing. > Maybe need 'preferred nodes' to avoid bad network topologies > > > Potential problems > ------------------ > corosync uses a packet-based protocol, TCP is a stream (I don't see this > as a big problem, TBH) > Where to hook the message transmission in the corosync networking stack? > - We don't need a lot of the totem messages > - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in > satellites [CPG, so probably yes]?) > Which is client/server? (if satellites are client with authkey we get > easy failover and config, but ... DOS potential??) > What if tcp buffers get full? Suggest just cutting off the node. > How to stop satellites from running totemsrp? > Fencing, do we need it? (pacemaker problem?) > GFS2? is this needed/possible? > Keeping two node lists (totem/quorum and satellite) - duplicate node IDs > are not allowed and this will need to be enforced. > No real idea if this will scale as well as I hope it will! > > > How it will (possibly) work > --------------------------- > Totemsrp messages will be unaffected (in 1st revision at least), > satellites are not part of this protocol > Totempg messages are sent around the ring as usual. > When one arrives at a node with satellites, it forwards it around the > ring as usual, then it sends that message to all of the satellites > in turn. > If a send fails then the node is cut off and removed from the > configuration. > When a message is received from a satellite it is repackaged as a > totempg message and sent around the cluster as normal. > Satellite nodes will be handled by another corosync service that is > loaded. > Use a new corosync service handler to maintain extra nodes list and > (maybe) do the satellite forwarding. > > - Joining > A satellite sends a TCP connect and then a join request to its > nominated (or fallback) host. > The host can accept or reject this for reasons of (at least): > - duplicated nodeid > - no capacity > - bad key > - bad config > The service then sends new node information to the rest of the cluster > quorum is not affected > > - leaving > If a TCP send fails or a socket is disconnected then the node is > summarily removed > - there will probably also be a 'leave' message for tidy removal > - leave notifications are sent around the cluster so that CPG and the > secondary nodelist know. > - quorum does not need to know. > > - failover > Satellites have a list of all nodes (quorum and satellite) and if a > TCP connection > is broken then they can try to contact the next node in the nodeid > list of quorum nodes > > Timescales > ---------- > Nothing decided at this stage, certainly Corosync 3.0 at the earliest as > it will break on-wire protocol. > Need to do a proof-of-concept, maybe using containers to get high node > count. > > > ------------------------------ > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> > http://lists.corosync.org/mailman/listinfo/discuss > > > End of discuss Digest, Vol 43, Issue 10 > *************************************** > > > > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss