RFC: Extending corosync to high node counts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Extending corosync
------------------

This is an idea that came out of several discussions at the cluster
summit in February. Please comment !

It is not meant to be a generalised solution to extending corosync for
most users. For single & double digit cluster sizes the current ring
protocols should be sufficient. This is intended to make corosync usable
over much larger node counts.

The problem
-----------
Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s)
This is mainly down to the requirements of virtual synchrony(VS) and the
ring protocol.

A proposed solution
-------------------
Have 'satellite' nodes that are not part of the ring (and do not not
participate in VS).
They communicate via a single 'host' node over (possibly) TCP. The host
sends the messages
to them in a 'send and forget' system - though TCP guaratees ordering
and delivery.
Host nodes can support many satellites. If a host goes down the
satellites can reconnect to
another node and carry on.

Satellites have no votes, and do not participate in Virtual Synchrony.

Satellites can send/receive CPG messages and get quorum information but
will not appear in
the quorum nodes list.

There must be a separate nodes list for satellites, probably maintained
by a different subsystem.
Satellites will have nodeIDs (required for CPG) that do not clash with
the ring nodeids.


Appearance to the user/admin
----------------------------
corosync.conf defines which nodes are satellites and which nodes to
connect to (initially). May
want some utility to force satellites to migrate from a node if it gets
full.

Future: Automatic configuration of who is in the VS cluster and who is a
satellite. Load balancing.
        Maybe need 'preferred nodes' to avoid bad network topologies


Potential problems
------------------
corosync uses a packet-based protocol, TCP is a stream (I don't see this
as a big problem, TBH)
Where to hook the message transmission in the corosync networking stack?
  - We don't need a lot of the totem messages
  - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in
satellites [CPG, so probably yes]?)
Which is client/server? (if satellites are client with authkey we get
easy failover and config, but ... DOS potential??)
What if tcp buffers get full? Suggest just cutting off the node.
How to stop satellites from running totemsrp?
Fencing, do we need it? (pacemaker problem?)
GFS2? is this needed/possible?
Keeping two node lists (totem/quorum and satellite) - duplicate node IDs
are not allowed and this will need to be enforced.
No real idea if this will scale as well as I hope it will!


How it will (possibly) work
---------------------------
Totemsrp messages will be unaffected (in 1st revision at least),
satellites are not part of this protocol
Totempg messages are sent around the ring as usual.
When one arrives at a node with satellites, it forwards it around the
ring as usual, then it sends that message to all of the satellites in turn.
If a send fails then the node is cut off and removed from the configuration.
When a message is received from a satellite it is repackaged as a
totempg message and sent around the cluster as normal.
Satellite nodes will be handled by another corosync service that is loaded.
Use a new corosync service handler to maintain extra nodes list and
(maybe) do the satellite forwarding.

- Joining
  A satellite sends a TCP connect and then a join request to its
nominated (or fallback) host.
  The host can accept or reject this for reasons of (at least):
   - duplicated nodeid
   - no capacity
   - bad key
   - bad config
  The service then sends new node information to the rest of the cluster
  quorum is not affected

 - leaving
   If a TCP send fails or a socket is disconnected then the node is
summarily removed
   - there will probably also be a 'leave' message for tidy removal
   - leave notifications are sent around the cluster so that CPG and the
secondary nodelist know.
   - quorum does not need to know.

 - failover
   Satellites have a list of all nodes (quorum and satellite) and if a
TCP connection
   is broken then they can try to contact the next node in the nodeid
list of quorum nodes

Timescales
----------
Nothing decided at this stage, certainly Corosync 3.0 at the earliest as
it will break on-wire protocol.
Need to do a proof-of-concept, maybe using containers to get high node
count.
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux