Re: RFC: Extending corosync to high node counts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On 31 Mar 2015, at 12:25 am, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
> 
> This is an updated document based on the responses I've had. Thank you
> everyone.
> 
> 
> 
> Extending corosync
> ------------------
> This is not meant to be a generalised solution to extending corosync for
> most users. For single & double digit cluster sizes the current ring
> protocols should be sufficient. This is intended to make corosync usable
> over much larger node counts.
> 
> The problem
> -----------
> Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s),
> this is mainly down to the requirements of the membership protocol and
> virtual synchrony(VS).
> 
> Terminology
> -----------
> I've used some words in this document that have specific meanings. They are:
> 
>    node: Any node participating in the cluster. Nodes have unique
> nodeids and can participate in CPG and some other corosync APIs (to be
> defined).
> 
>    quorum node: A node in the main part of the cluster. This node has a
> vote and participates in the membership protocol and virtual synchrony.
> 
>    quorum: the minimum number of non-satellite nodes in a cluster that
> must be present for it to continue operation.
> 
>    parent node: A quorum node that also serves satellite nodes.
> 
>    satellite node: A node that does not participate in quorum,
> membership or VS and is connected to a parent node over TCP/IP.
> 
>    node id: A 32 bit integer that uniquely identifies a node in the
> cluster.
> 
>    virtual synchrony(VS): The core of the central corosync messaging
> system. Messages are  delivered in order such that once a node receives
> its copy of the  message it know for sure that all nodes in the cluster
> have also received it (simplified definition!).
> 
>    CPG: Closed Process Groups. The main API-provided messaging system
> inside corosync
> 
> A proposed solution
> -------------------
> 
>    Have 'satellite' nodes that are not part of the ring (and do not not
> participate in VS).
> 
>    They communicate via a single parent node over (probably) TCP. The
> parent  sends the messages to them in a 'send and forget' system -
> though TCP  guaratees ordering and delivery.
> 
>    Parent nodes can support many satellites. If a parent goes down then
> its  satellites can be reconnected to another parent node and carry on.
> 
>    Satellites have no votes and do not participate in the normal
> membership protocol or Virtual Synchrony.
> 
>    Satellites can send/receive CPG messages and get quorum information
> but will not appear in the quorum nodes list.
> 
>    There must be a separate nodes list for satellites, probably
> maintained by a different subsystem/daemon.
> 
>    Satellites will have node IDs (required for CPG) that do not clash
> with the ring nodeids.
> 
> Appearance to the user/admin
> -----------------------------=
> corosync.conf defines which nodes are satellites and which nodes to
> connect to (initially). We may want some utility to force satellites to
> migrate from a node if it gets full.
> Future: Automatic configuration of who is in the VS cluster and who is a
> satellite. Load balancing.
> Maybe need 'preferred nodes' to avoid bad network topologies
> 
> Potential problems
> ------------------
>    corosync uses a packet-based protocol, TCP is a stream (I don't see
> this as a big problem, TBH)
> 
>    Which is client/server? (if satellites are client with authkey we
> get easy failover and config, but ... DOS potential??)

satellites have to be the server.
otherwise security and failure are a nightmare.

> 
>    How to 'fake' satellite node IDs in the CPG nodes list - will
> probably need to extend the libcpg API.
> 
>    do we need to add 'fake' join/leave events too?
> 
>    What if tcp buffers get full? Suggest just cutting off the node.
> 
>    Fencing, do we need it? (pacemaker problem?)
> 
>    Keeping two node lists (totem/quorum and satellite) - duplicate node
> IDs are not allowed and this will need to be enforced.
> 
>    No real idea if this will scale as well as I hope it will!
> 
>    GFS2 et al? is this needed/possible?

I’d not go there :)

> 
>    How (if at all) does knet fit into all this?
> 
> How it will (possibly) work
> ---------------------------
> Have a separate daemon that runs on a corosync parent node and
> communicates between the local corosync & its satellites
> IDEA: Can we use the 'real' corosync libs and have a different server
> back end on the satellites?
> - reuse the corosync server-side IPC code
> 
> CPG - would just be forwarded on to the parent with node ID 'fixed'
> cmap - forwarded to parent corosync
> quorum - keep own context
> CFG - shutdown request as corosync cfg
> 
> Need some API (or cmap?) for satellite node list

If you have the daemon make one connection per satellite (maybe by spawning a child for each one) they’d automatically show up in the CPG list.

> Use a separate CPG for managing the satellites node list etc
> 
> Does the satellite pacemaker/others need to know it is running on a
> satellite?
> - We can add a cmap key to hold this info.
> 
> - joining
>   It's best(*) for the parents to boot the satellites
>     (*more secure, less DoS possibilities, more control)
>     - do we poll for dead satellites? how often? how?(connect?, ping?)
>     - CPG group to determine who is the parent of a satellite when a
> parent leaves
>        - allows easy failover & maintenance of node list
> 
> - leaving
>   If a TCP send fails or a socket is disconnected then the node is
> summarily removed
>   - there will probably also be a 'leave' message sent by the parent
> for tidy removal
>   - leave notifications are sent around the cluster so that the
> secondary nodelist knows.
>   - quorum does not need to know.
>   - if a parent leaves then we need to send satellite node down
> messages too (in the
>     new service/private CPG) not for quorum, but for cpg clients.
> 
> - failover
>   When a parent fails or leaves, another suitable parent should contact
> the orphaned satellites and try to include them back in the cluster.
> Sone form of network topology might be nice here so the nearest parent
> contacts the satellite.
>   - also load balancing?
> 
> Timescales
> ----------
> Nothing decided at this stage, probably Corosync 3.0 at the earliest.
> Need to do a proof-of-concept, maybe using containers to get high node
> count.
> 
> Corosync services used by pacemaker (please check!)
> ---------------------------------------------------
> CPG  - obviously
> CFG  - used to prevent corosync shutdown if pacemaker is running
> cmap - Need to client-server this on a per-request basis
>           used for nodelist and logging options AFAICT
>           so mainly called at startup
> quorum - including notification

looks right.  These are the headers I see us using:

 <corosync/cfg.h>
 <corosync/cmap.h>
 <corosync/confdb.h>
 <corosync/corodefs.h>
 <corosync/corotypes.h>
 <corosync/cpg.h>
 <corosync/engine/config.h>
 <corosync/engine/objdb.h>
 <corosync/hdb.h>
 <corosync/quorum.h>
 <corosync/totem/totempg.h>

> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss





[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux