Re: RFC: Extending corosync to high node counts

Darren Thompson <darrent@xxxxxxxxxxxxx> · Fri, 20 Mar 2015 06:47:20 +1100

Team
My first though re this proposal, what mechanism is going to be in-place to ensure that there is no "split brain" scenario.

For smaller rings we can rely on the fall-back of a "shared media/SBD device" to ensure that there is consistency/

If there is a comms interruption between ring members, is there a danger that each reaming half will then recruit new nodes from their "satellite spares"?

Do we need to consider a mechanism to adapt the node configuration (e.g. adding SBD devices) ir is that just going to complicate things further?

Darren Thompson
Professional Services Engineer / Consultant

Level 3, 60 City Road
Southgate, VIC 3006
Mb: 0400 640 414
Mail: darrent@akurit.com.au

Web: www.akurit.com.au

On 19 March 2015 at 23:00,  <discuss-request@xxxxxxxxxxxx> wrote:
Send discuss mailing list submissions to

        discuss@xxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit

        http://lists.corosync.org/mailman/listinfo/discuss

or, via email, send a message with subject or body 'help' to

        discuss-request@xxxxxxxxxxxx

You can reach the person managing the list at

        discuss-owner@xxxxxxxxxxxx

When replying, please edit your Subject line so it is more specific

than "Re: Contents of discuss digest..."

Today's Topics:

   1. RFC: Extending corosync to high node counts (Christine Caulfield)

----------------------------------------------------------------------

Message: 1

Date: Thu, 19 Mar 2015 10:05:41 +0000

From: Christine Caulfield <ccaulfie@xxxxxxxxxx>

To: discuss@xxxxxxxxxxxx

Subject:  RFC: Extending corosync to high node counts

Message-ID: <550A9F75.5010309@xxxxxxxxxx>

Content-Type: text/plain; charset=utf-8

Extending corosync

------------------

This is an idea that came out of several discussions at the cluster

summit in February. Please comment !

It is not meant to be a generalised solution to extending corosync for

most users. For single & double digit cluster sizes the current ring

protocols should be sufficient. This is intended to make corosync usable

over much larger node counts.

The problem

-----------

Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s)

This is mainly down to the requirements of virtual synchrony(VS) and the

ring protocol.

A proposed solution

-------------------

Have 'satellite' nodes that are not part of the ring (and do not not

participate in VS).

They communicate via a single 'host' node over (possibly) TCP. The host

sends the messages

to them in a 'send and forget' system - though TCP guaratees ordering

and delivery.

Host nodes can support many satellites. If a host goes down the

satellites can reconnect to

another node and carry on.

Satellites have no votes, and do not participate in Virtual Synchrony.

Satellites can send/receive CPG messages and get quorum information but

will not appear in

the quorum nodes list.

There must be a separate nodes list for satellites, probably maintained

by a different subsystem.

Satellites will have nodeIDs (required for CPG) that do not clash with

the ring nodeids.

Appearance to the user/admin

----------------------------

corosync.conf defines which nodes are satellites and which nodes to

connect to (initially). May

want some utility to force satellites to migrate from a node if it gets

full.

Future: Automatic configuration of who is in the VS cluster and who is a

satellite. Load balancing.

        Maybe need 'preferred nodes' to avoid bad network topologies

Potential problems

------------------

corosync uses a packet-based protocol, TCP is a stream (I don't see this

as a big problem, TBH)

Where to hook the message transmission in the corosync networking stack?

  - We don't need a lot of the totem messages

  - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in

satellites [CPG, so probably yes]?)

Which is client/server? (if satellites are client with authkey we get

easy failover and config, but ... DOS potential??)

What if tcp buffers get full? Suggest just cutting off the node.

How to stop satellites from running totemsrp?

Fencing, do we need it? (pacemaker problem?)

GFS2? is this needed/possible?

Keeping two node lists (totem/quorum and satellite) - duplicate node IDs

are not allowed and this will need to be enforced.

No real idea if this will scale as well as I hope it will!

How it will (possibly) work

---------------------------

Totemsrp messages will be unaffected (in 1st revision at least),

satellites are not part of this protocol

Totempg messages are sent around the ring as usual.

When one arrives at a node with satellites, it forwards it around the

ring as usual, then it sends that message to all of the satellites in turn.

If a send fails then the node is cut off and removed from the configuration.

When a message is received from a satellite it is repackaged as a

totempg message and sent around the cluster as normal.

Satellite nodes will be handled by another corosync service that is loaded.

Use a new corosync service handler to maintain extra nodes list and

(maybe) do the satellite forwarding.

- Joining

  A satellite sends a TCP connect and then a join request to its

nominated (or fallback) host.

  The host can accept or reject this for reasons of (at least):

   - duplicated nodeid

   - no capacity

   - bad key

   - bad config

  The service then sends new node information to the rest of the cluster

  quorum is not affected

 - leaving

   If a TCP send fails or a socket is disconnected then the node is

summarily removed

   - there will probably also be a 'leave' message for tidy removal

   - leave notifications are sent around the cluster so that CPG and the

secondary nodelist know.

   - quorum does not need to know.

 - failover

   Satellites have a list of all nodes (quorum and satellite) and if a

TCP connection

   is broken then they can try to contact the next node in the nodeid

list of quorum nodes

Timescales

----------

Nothing decided at this stage, certainly Corosync 3.0 at the earliest as

it will break on-wire protocol.

Need to do a proof-of-concept, maybe using containers to get high node

count.

------------------------------

_______________________________________________

discuss mailing list

discuss@xxxxxxxxxxxx

http://lists.corosync.org/mailman/listinfo/discuss

End of discuss Digest, Vol 43, Issue 10

***************************************

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss