Re: RFC: Extending corosync to high node counts

Andrew Beekhof <andrew@xxxxxxxxxxx> · Mon, 23 Mar 2015 14:11:44 +1100

> On 19 Mar 2015, at 9:05 pm, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
> 
> Extending corosync
> ------------------
> 
> This is an idea that came out of several discussions at the cluster
> summit in February. Please comment !
> 
> It is not meant to be a generalised solution to extending corosync for
> most users. For single & double digit cluster sizes the current ring
> protocols should be sufficient. This is intended to make corosync usable
> over much larger node counts.
> 
> The problem
> -----------
> Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s)
> This is mainly down to the requirements of virtual synchrony(VS) and the
> ring protocol.
> 
> A proposed solution
> -------------------
> Have 'satellite' nodes that are not part of the ring (and do not not
> participate in VS).
> They communicate via a single 'host' node over (possibly) TCP. The host
> sends the messages
> to them in a 'send and forget' system - though TCP guaratees ordering
> and delivery.
> Host nodes can support many satellites. If a host goes down the
> satellites can reconnect to
> another node and carry on.
> 
> Satellites have no votes, and do not participate in Virtual Synchrony.
> 
> Satellites can send/receive CPG messages and get quorum information but
> will not appear in
> the quorum nodes list.
> 
> There must be a separate nodes list for satellites, probably maintained
> by a different subsystem.
> Satellites will have nodeIDs (required for CPG) that do not clash with
> the ring nodeids.
> 
> 
> Appearance to the user/admin
> ----------------------------
> corosync.conf defines which nodes are satellites and which nodes to
> connect to (initially). May
> want some utility to force satellites to migrate from a node if it gets
> full.
> 
> Future: Automatic configuration of who is in the VS cluster and who is a
> satellite. Load balancing.
>        Maybe need 'preferred nodes' to avoid bad network topologies
> 
> 
> Potential problems
> ------------------
> corosync uses a packet-based protocol, TCP is a stream (I don't see this
> as a big problem, TBH)
> Where to hook the message transmission in the corosync networking stack?
>  - We don't need a lot of the totem messages
>  - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in
> satellites [CPG, so probably yes]?)
> Which is client/server? (if satellites are client with authkey we get
> easy failover and config, but ... DOS potential??)
> What if tcp buffers get full? Suggest just cutting off the node.
> How to stop satellites from running totemsrp?
> Fencing, do we need it? (pacemaker problem?)

That has traditionally been the model and it still seems appropriate.
However Darren raises an interesting point... how will satellites know which is the "correct" partition to connect to?

What would it look like if we flipped it around and had the full peers connecting to the satellites?
You could then tie that to having quorum. You also know that a fenced full peer wont have any connections.
Safety on two levels.

> GFS2? is this needed/possible?
> Keeping two node lists (totem/quorum and satellite) - duplicate node IDs
> are not allowed and this will need to be enforced.
> No real idea if this will scale as well as I hope it will!
> 
> 
> How it will (possibly) work
> ---------------------------
> Totemsrp messages will be unaffected (in 1st revision at least),
> satellites are not part of this protocol
> Totempg messages are sent around the ring as usual.
> When one arrives at a node with satellites, it forwards it around the
> ring as usual, then it sends that message to all of the satellites in turn.
> If a send fails then the node is cut off and removed from the configuration.
> When a message is received from a satellite it is repackaged as a
> totempg message and sent around the cluster as normal.
> Satellite nodes will be handled by another corosync service that is loaded.
> Use a new corosync service handler to maintain extra nodes list and
> (maybe) do the satellite forwarding.
> 
> - Joining
>  A satellite sends a TCP connect and then a join request to its
> nominated (or fallback) host.
>  The host can accept or reject this for reasons of (at least):
>   - duplicated nodeid
>   - no capacity
>   - bad key
>   - bad config
>  The service then sends new node information to the rest of the cluster
>  quorum is not affected
> 
> - leaving
>   If a TCP send fails or a socket is disconnected then the node is
> summarily removed
>   - there will probably also be a 'leave' message for tidy removal
>   - leave notifications are sent around the cluster so that CPG and the
> secondary nodelist know.
>   - quorum does not need to know.
> 
> - failover
>   Satellites have a list of all nodes (quorum and satellite) and if a
> TCP connection
>   is broken then they can try to contact the next node in the nodeid
> list of quorum nodes
> 
> Timescales
> ----------
> Nothing decided at this stage, certainly Corosync 3.0 at the earliest as
> it will break on-wire protocol.
> Need to do a proof-of-concept, maybe using containers to get high node
> count.
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss