> On 31 Mar 2015, at 12:25 am, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote: > > This is an updated document based on the responses I've had. Thank you > everyone. > > > > Extending corosync > ------------------ > This is not meant to be a generalised solution to extending corosync for > most users. For single & double digit cluster sizes the current ring > protocols should be sufficient. This is intended to make corosync usable > over much larger node counts. > > The problem > ----------- > Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s), > this is mainly down to the requirements of the membership protocol and > virtual synchrony(VS). > > Terminology > ----------- > I've used some words in this document that have specific meanings. They are: > > node: Any node participating in the cluster. Nodes have unique > nodeids and can participate in CPG and some other corosync APIs (to be > defined). > > quorum node: A node in the main part of the cluster. This node has a > vote and participates in the membership protocol and virtual synchrony. > > quorum: the minimum number of non-satellite nodes in a cluster that > must be present for it to continue operation. > > parent node: A quorum node that also serves satellite nodes. > > satellite node: A node that does not participate in quorum, > membership or VS and is connected to a parent node over TCP/IP. > > node id: A 32 bit integer that uniquely identifies a node in the > cluster. > > virtual synchrony(VS): The core of the central corosync messaging > system. Messages are delivered in order such that once a node receives > its copy of the message it know for sure that all nodes in the cluster > have also received it (simplified definition!). > > CPG: Closed Process Groups. The main API-provided messaging system > inside corosync > > A proposed solution > ------------------- > > Have 'satellite' nodes that are not part of the ring (and do not not > participate in VS). > > They communicate via a single parent node over (probably) TCP. The > parent sends the messages to them in a 'send and forget' system - > though TCP guaratees ordering and delivery. > > Parent nodes can support many satellites. If a parent goes down then > its satellites can be reconnected to another parent node and carry on. > > Satellites have no votes and do not participate in the normal > membership protocol or Virtual Synchrony. > > Satellites can send/receive CPG messages and get quorum information > but will not appear in the quorum nodes list. > > There must be a separate nodes list for satellites, probably > maintained by a different subsystem/daemon. > > Satellites will have node IDs (required for CPG) that do not clash > with the ring nodeids. > > Appearance to the user/admin > -----------------------------= > corosync.conf defines which nodes are satellites and which nodes to > connect to (initially). We may want some utility to force satellites to > migrate from a node if it gets full. > Future: Automatic configuration of who is in the VS cluster and who is a > satellite. Load balancing. > Maybe need 'preferred nodes' to avoid bad network topologies > > Potential problems > ------------------ > corosync uses a packet-based protocol, TCP is a stream (I don't see > this as a big problem, TBH) > > Which is client/server? (if satellites are client with authkey we > get easy failover and config, but ... DOS potential??) satellites have to be the server. otherwise security and failure are a nightmare. > > How to 'fake' satellite node IDs in the CPG nodes list - will > probably need to extend the libcpg API. > > do we need to add 'fake' join/leave events too? > > What if tcp buffers get full? Suggest just cutting off the node. > > Fencing, do we need it? (pacemaker problem?) > > Keeping two node lists (totem/quorum and satellite) - duplicate node > IDs are not allowed and this will need to be enforced. > > No real idea if this will scale as well as I hope it will! > > GFS2 et al? is this needed/possible? I’d not go there :) > > How (if at all) does knet fit into all this? > > How it will (possibly) work > --------------------------- > Have a separate daemon that runs on a corosync parent node and > communicates between the local corosync & its satellites > IDEA: Can we use the 'real' corosync libs and have a different server > back end on the satellites? > - reuse the corosync server-side IPC code > > CPG - would just be forwarded on to the parent with node ID 'fixed' > cmap - forwarded to parent corosync > quorum - keep own context > CFG - shutdown request as corosync cfg > > Need some API (or cmap?) for satellite node list If you have the daemon make one connection per satellite (maybe by spawning a child for each one) they’d automatically show up in the CPG list. > Use a separate CPG for managing the satellites node list etc > > Does the satellite pacemaker/others need to know it is running on a > satellite? > - We can add a cmap key to hold this info. > > - joining > It's best(*) for the parents to boot the satellites > (*more secure, less DoS possibilities, more control) > - do we poll for dead satellites? how often? how?(connect?, ping?) > - CPG group to determine who is the parent of a satellite when a > parent leaves > - allows easy failover & maintenance of node list > > - leaving > If a TCP send fails or a socket is disconnected then the node is > summarily removed > - there will probably also be a 'leave' message sent by the parent > for tidy removal > - leave notifications are sent around the cluster so that the > secondary nodelist knows. > - quorum does not need to know. > - if a parent leaves then we need to send satellite node down > messages too (in the > new service/private CPG) not for quorum, but for cpg clients. > > - failover > When a parent fails or leaves, another suitable parent should contact > the orphaned satellites and try to include them back in the cluster. > Sone form of network topology might be nice here so the nearest parent > contacts the satellite. > - also load balancing? > > Timescales > ---------- > Nothing decided at this stage, probably Corosync 3.0 at the earliest. > Need to do a proof-of-concept, maybe using containers to get high node > count. > > Corosync services used by pacemaker (please check!) > --------------------------------------------------- > CPG - obviously > CFG - used to prevent corosync shutdown if pacemaker is running > cmap - Need to client-server this on a per-request basis > used for nodelist and logging options AFAICT > so mainly called at startup > quorum - including notification looks right. These are the headers I see us using: <corosync/cfg.h> <corosync/cmap.h> <corosync/confdb.h> <corosync/corodefs.h> <corosync/corotypes.h> <corosync/cpg.h> <corosync/engine/config.h> <corosync/engine/objdb.h> <corosync/hdb.h> <corosync/quorum.h> <corosync/totem/totempg.h> > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss