This is an updated document based on the responses I've had. Thank you everyone. Extending corosync ------------------ This is not meant to be a generalised solution to extending corosync for most users. For single & double digit cluster sizes the current ring protocols should be sufficient. This is intended to make corosync usable over much larger node counts. The problem ----------- Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s), this is mainly down to the requirements of the membership protocol and virtual synchrony(VS). Terminology ----------- I've used some words in this document that have specific meanings. They are: node: Any node participating in the cluster. Nodes have unique nodeids and can participate in CPG and some other corosync APIs (to be defined). quorum node: A node in the main part of the cluster. This node has a vote and participates in the membership protocol and virtual synchrony. quorum: the minimum number of non-satellite nodes in a cluster that must be present for it to continue operation. parent node: A quorum node that also serves satellite nodes. satellite node: A node that does not participate in quorum, membership or VS and is connected to a parent node over TCP/IP. node id: A 32 bit integer that uniquely identifies a node in the cluster. virtual synchrony(VS): The core of the central corosync messaging system. Messages are delivered in order such that once a node receives its copy of the message it know for sure that all nodes in the cluster have also received it (simplified definition!). CPG: Closed Process Groups. The main API-provided messaging system inside corosync A proposed solution ------------------- Have 'satellite' nodes that are not part of the ring (and do not not participate in VS). They communicate via a single parent node over (probably) TCP. The parent sends the messages to them in a 'send and forget' system - though TCP guaratees ordering and delivery. Parent nodes can support many satellites. If a parent goes down then its satellites can be reconnected to another parent node and carry on. Satellites have no votes and do not participate in the normal membership protocol or Virtual Synchrony. Satellites can send/receive CPG messages and get quorum information but will not appear in the quorum nodes list. There must be a separate nodes list for satellites, probably maintained by a different subsystem/daemon. Satellites will have node IDs (required for CPG) that do not clash with the ring nodeids. Appearance to the user/admin -----------------------------= corosync.conf defines which nodes are satellites and which nodes to connect to (initially). We may want some utility to force satellites to migrate from a node if it gets full. Future: Automatic configuration of who is in the VS cluster and who is a satellite. Load balancing. Maybe need 'preferred nodes' to avoid bad network topologies Potential problems ------------------ corosync uses a packet-based protocol, TCP is a stream (I don't see this as a big problem, TBH) Which is client/server? (if satellites are client with authkey we get easy failover and config, but ... DOS potential??) How to 'fake' satellite node IDs in the CPG nodes list - will probably need to extend the libcpg API. do we need to add 'fake' join/leave events too? What if tcp buffers get full? Suggest just cutting off the node. Fencing, do we need it? (pacemaker problem?) Keeping two node lists (totem/quorum and satellite) - duplicate node IDs are not allowed and this will need to be enforced. No real idea if this will scale as well as I hope it will! GFS2 et al? is this needed/possible? How (if at all) does knet fit into all this? How it will (possibly) work --------------------------- Have a separate daemon that runs on a corosync parent node and communicates between the local corosync & its satellites IDEA: Can we use the 'real' corosync libs and have a different server back end on the satellites? - reuse the corosync server-side IPC code CPG - would just be forwarded on to the parent with node ID 'fixed' cmap - forwarded to parent corosync quorum - keep own context CFG - shutdown request as corosync cfg Need some API (or cmap?) for satellite node list Use a separate CPG for managing the satellites node list etc Does the satellite pacemaker/others need to know it is running on a satellite? - We can add a cmap key to hold this info. - joining It's best(*) for the parents to boot the satellites (*more secure, less DoS possibilities, more control) - do we poll for dead satellites? how often? how?(connect?, ping?) - CPG group to determine who is the parent of a satellite when a parent leaves - allows easy failover & maintenance of node list - leaving If a TCP send fails or a socket is disconnected then the node is summarily removed - there will probably also be a 'leave' message sent by the parent for tidy removal - leave notifications are sent around the cluster so that the secondary nodelist knows. - quorum does not need to know. - if a parent leaves then we need to send satellite node down messages too (in the new service/private CPG) not for quorum, but for cpg clients. - failover When a parent fails or leaves, another suitable parent should contact the orphaned satellites and try to include them back in the cluster. Sone form of network topology might be nice here so the nearest parent contacts the satellite. - also load balancing? Timescales ---------- Nothing decided at this stage, probably Corosync 3.0 at the earliest. Need to do a proof-of-concept, maybe using containers to get high node count. Corosync services used by pacemaker (please check!) --------------------------------------------------- CPG - obviously CFG - used to prevent corosync shutdown if pacemaker is running cmap - Need to client-server this on a per-request basis used for nodelist and logging options AFAICT so mainly called at startup quorum - including notification _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss