Re: RFC: Extending corosync to high node counts

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Mon, 23 Mar 2015 09:09:58 +0000

On 23/03/15 03:11, Andrew Beekhof wrote:
> 
>> On 19 Mar 2015, at 9:05 pm, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
>>
>> Extending corosync
>> ------------------
>>
>> This is an idea that came out of several discussions at the cluster
>> summit in February. Please comment !
>>
>> It is not meant to be a generalised solution to extending corosync for
>> most users. For single & double digit cluster sizes the current ring
>> protocols should be sufficient. This is intended to make corosync usable
>> over much larger node counts.
>>
>> The problem
>> -----------
>> Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s)
>> This is mainly down to the requirements of virtual synchrony(VS) and the
>> ring protocol.
>>
>> A proposed solution
>> -------------------
>> Have 'satellite' nodes that are not part of the ring (and do not not
>> participate in VS).
>> They communicate via a single 'host' node over (possibly) TCP. The host
>> sends the messages
>> to them in a 'send and forget' system - though TCP guaratees ordering
>> and delivery.
>> Host nodes can support many satellites. If a host goes down the
>> satellites can reconnect to
>> another node and carry on.
>>
>> Satellites have no votes, and do not participate in Virtual Synchrony.
>>
>> Satellites can send/receive CPG messages and get quorum information but
>> will not appear in
>> the quorum nodes list.
>>
>> There must be a separate nodes list for satellites, probably maintained
>> by a different subsystem.
>> Satellites will have nodeIDs (required for CPG) that do not clash with
>> the ring nodeids.
>>
>>
>> Appearance to the user/admin
>> ----------------------------
>> corosync.conf defines which nodes are satellites and which nodes to
>> connect to (initially). May
>> want some utility to force satellites to migrate from a node if it gets
>> full.
>>
>> Future: Automatic configuration of who is in the VS cluster and who is a
>> satellite. Load balancing.
>>        Maybe need 'preferred nodes' to avoid bad network topologies
>>
>>
>> Potential problems
>> ------------------
>> corosync uses a packet-based protocol, TCP is a stream (I don't see this
>> as a big problem, TBH)
>> Where to hook the message transmission in the corosync networking stack?
>>  - We don't need a lot of the totem messages
>>  - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in
>> satellites [CPG, so probably yes]?)
>> Which is client/server? (if satellites are client with authkey we get
>> easy failover and config, but ... DOS potential??)
>> What if tcp buffers get full? Suggest just cutting off the node.
>> How to stop satellites from running totemsrp?
>> Fencing, do we need it? (pacemaker problem?)
> 
> That has traditionally been the model and it still seems appropriate.
> However Darren raises an interesting point... how will satellites know which is the "correct" partition to connect to?
> 
> What would it look like if we flipped it around and had the full peers connecting to the satellites?
> You could then tie that to having quorum. You also know that a fenced full peer wont have any connections.
> Safety on two levels.
> 

I think this is a better, if slightly more complex model to implement,
yes.  It also avoids the potential DoS of satellites trying to contact
central cluster nodes repeatedly.

Chrissie
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss