Re: RFC: Extending corosync to high node counts

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Mon, 30 Mar 2015 14:25:43 +0100

This is an updated document based on the responses I've had. Thank you
everyone.

Extending corosync
------------------
This is not meant to be a generalised solution to extending corosync for
most users. For single & double digit cluster sizes the current ring
protocols should be sufficient. This is intended to make corosync usable
over much larger node counts.

The problem
-----------
Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s),
this is mainly down to the requirements of the membership protocol and
virtual synchrony(VS).

Terminology
-----------
I've used some words in this document that have specific meanings. They are:

    node: Any node participating in the cluster. Nodes have unique
nodeids and can participate in CPG and some other corosync APIs (to be
defined).

    quorum node: A node in the main part of the cluster. This node has a
vote and participates in the membership protocol and virtual synchrony.

    quorum: the minimum number of non-satellite nodes in a cluster that
must be present for it to continue operation.

    parent node: A quorum node that also serves satellite nodes.

    satellite node: A node that does not participate in quorum,
membership or VS and is connected to a parent node over TCP/IP.

    node id: A 32 bit integer that uniquely identifies a node in the
cluster.

    virtual synchrony(VS): The core of the central corosync messaging
system. Messages are  delivered in order such that once a node receives
its copy of the  message it know for sure that all nodes in the cluster
have also received it (simplified definition!).

    CPG: Closed Process Groups. The main API-provided messaging system
inside corosync

A proposed solution
-------------------

    Have 'satellite' nodes that are not part of the ring (and do not not
participate in VS).

    They communicate via a single parent node over (probably) TCP. The
parent  sends the messages to them in a 'send and forget' system -
though TCP  guaratees ordering and delivery.

    Parent nodes can support many satellites. If a parent goes down then
its  satellites can be reconnected to another parent node and carry on.

    Satellites have no votes and do not participate in the normal
membership protocol or Virtual Synchrony.

    Satellites can send/receive CPG messages and get quorum information
but will not appear in the quorum nodes list.

    There must be a separate nodes list for satellites, probably
maintained by a different subsystem/daemon.

    Satellites will have node IDs (required for CPG) that do not clash
with the ring nodeids.

Appearance to the user/admin
-----------------------------=
corosync.conf defines which nodes are satellites and which nodes to
connect to (initially). We may want some utility to force satellites to
migrate from a node if it gets full.
Future: Automatic configuration of who is in the VS cluster and who is a
satellite. Load balancing.
Maybe need 'preferred nodes' to avoid bad network topologies

Potential problems
------------------
    corosync uses a packet-based protocol, TCP is a stream (I don't see
this as a big problem, TBH)

    Which is client/server? (if satellites are client with authkey we
get easy failover and config, but ... DOS potential??)

    How to 'fake' satellite node IDs in the CPG nodes list - will
probably need to extend the libcpg API.

    do we need to add 'fake' join/leave events too?

    What if tcp buffers get full? Suggest just cutting off the node.

    Fencing, do we need it? (pacemaker problem?)

    Keeping two node lists (totem/quorum and satellite) - duplicate node
IDs are not allowed and this will need to be enforced.

    No real idea if this will scale as well as I hope it will!

    GFS2 et al? is this needed/possible?

    How (if at all) does knet fit into all this?

How it will (possibly) work
---------------------------
Have a separate daemon that runs on a corosync parent node and
communicates between the local corosync & its satellites
IDEA: Can we use the 'real' corosync libs and have a different server
back end on the satellites?
- reuse the corosync server-side IPC code

CPG - would just be forwarded on to the parent with node ID 'fixed'
cmap - forwarded to parent corosync
quorum - keep own context
CFG - shutdown request as corosync cfg

Need some API (or cmap?) for satellite node list
Use a separate CPG for managing the satellites node list etc

Does the satellite pacemaker/others need to know it is running on a
satellite?
- We can add a cmap key to hold this info.

 - joining
   It's best(*) for the parents to boot the satellites
     (*more secure, less DoS possibilities, more control)
     - do we poll for dead satellites? how often? how?(connect?, ping?)
     - CPG group to determine who is the parent of a satellite when a
parent leaves
        - allows easy failover & maintenance of node list

 - leaving
   If a TCP send fails or a socket is disconnected then the node is
summarily removed
   - there will probably also be a 'leave' message sent by the parent
for tidy removal
   - leave notifications are sent around the cluster so that the
secondary nodelist knows.
   - quorum does not need to know.
   - if a parent leaves then we need to send satellite node down
messages too (in the
     new service/private CPG) not for quorum, but for cpg clients.

 - failover
   When a parent fails or leaves, another suitable parent should contact
the orphaned satellites and try to include them back in the cluster.
Sone form of network topology might be nice here so the nearest parent
contacts the satellite.
   - also load balancing?

Timescales
----------
Nothing decided at this stage, probably Corosync 3.0 at the earliest.
Need to do a proof-of-concept, maybe using containers to get high node
count.

Corosync services used by pacemaker (please check!)
---------------------------------------------------
CPG  - obviously
CFG  - used to prevent corosync shutdown if pacemaker is running
cmap - Need to client-server this on a per-request basis
           used for nodelist and logging options AFAICT
           so mainly called at startup
quorum - including notification

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss