Re: RFC: Extending corosync to high node counts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 19/03/15 19:47, Darren Thompson wrote:
> Team
> 
> My first though re this proposal, what mechanism is going to be in-place
> to ensure that there is no "split brain" scenario.
> 
> For smaller rings we can rely on the fall-back of a "shared media/SBD
> device" to ensure that there is consistency/
> 
> If there is a comms interruption between ring members, is there a danger
> that each reaming half will then recruit new nodes from their "satellite
> spares"?


Unless I've misunderstood you, I don't think this is a problem because
the satellites have no votes and don't affect quorum in any way. If
there is a split in the VS part of the cluster then the active partition
will be determined using the normal quorum/fencing methods and any of
their connected satellites will therefore also be inactive.

Chrissie


> Do we need to consider a mechanism to adapt the node configuration (e.g.
> adding SBD devices) ir is that just going to complicate things further?
> 
> Darren Thompson
> 
> Professional Services Engineer / Consultant
> 
>  *cid:image001.jpg@01CB7C0C.6C6A2AE0*
> 
> Level 3, 60 City Road
> 
> Southgate, VIC 3006
> 
> Mb: 0400 640 414
> 
> Mail: darrent@xxxxxxxxxxxxx <mailto:steve@xxxxxxxxxxxxx>
> Web: www.akurit.com.au <http://www.akurit.com.au/>
> 
> 
> On 19 March 2015 at 23:00, <discuss-request@xxxxxxxxxxxx
> <mailto:discuss-request@xxxxxxxxxxxx>> wrote:
> 
>     Send discuss mailing list submissions to
>             discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
> 
>     To subscribe or unsubscribe via the World Wide Web, visit
>             http://lists.corosync.org/mailman/listinfo/discuss
>     or, via email, send a message with subject or body 'help' to
>             discuss-request@xxxxxxxxxxxx
>     <mailto:discuss-request@xxxxxxxxxxxx>
> 
>     You can reach the person managing the list at
>             discuss-owner@xxxxxxxxxxxx <mailto:discuss-owner@xxxxxxxxxxxx>
> 
>     When replying, please edit your Subject line so it is more specific
>     than "Re: Contents of discuss digest..."
> 
> 
>     Today's Topics:
> 
>        1. RFC: Extending corosync to high node counts (Christine Caulfield)
> 
> 
>     ----------------------------------------------------------------------
> 
>     Message: 1
>     Date: Thu, 19 Mar 2015 10:05:41 +0000
>     From: Christine Caulfield <ccaulfie@xxxxxxxxxx
>     <mailto:ccaulfie@xxxxxxxxxx>>
>     To: discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>     Subject:  RFC: Extending corosync to high node counts
>     Message-ID: <550A9F75.5010309@xxxxxxxxxx
>     <mailto:550A9F75.5010309@xxxxxxxxxx>>
>     Content-Type: text/plain; charset=utf-8
> 
>     Extending corosync
>     ------------------
> 
>     This is an idea that came out of several discussions at the cluster
>     summit in February. Please comment !
> 
>     It is not meant to be a generalised solution to extending corosync for
>     most users. For single & double digit cluster sizes the current ring
>     protocols should be sufficient. This is intended to make corosync usable
>     over much larger node counts.
> 
>     The problem
>     -----------
>     Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s)
>     This is mainly down to the requirements of virtual synchrony(VS) and the
>     ring protocol.
> 
>     A proposed solution
>     -------------------
>     Have 'satellite' nodes that are not part of the ring (and do not not
>     participate in VS).
>     They communicate via a single 'host' node over (possibly) TCP. The host
>     sends the messages
>     to them in a 'send and forget' system - though TCP guaratees ordering
>     and delivery.
>     Host nodes can support many satellites. If a host goes down the
>     satellites can reconnect to
>     another node and carry on.
> 
>     Satellites have no votes, and do not participate in Virtual Synchrony.
> 
>     Satellites can send/receive CPG messages and get quorum information but
>     will not appear in
>     the quorum nodes list.
> 
>     There must be a separate nodes list for satellites, probably maintained
>     by a different subsystem.
>     Satellites will have nodeIDs (required for CPG) that do not clash with
>     the ring nodeids.
> 
> 
>     Appearance to the user/admin
>     ----------------------------
>     corosync.conf defines which nodes are satellites and which nodes to
>     connect to (initially). May
>     want some utility to force satellites to migrate from a node if it gets
>     full.
> 
>     Future: Automatic configuration of who is in the VS cluster and who is a
>     satellite. Load balancing.
>             Maybe need 'preferred nodes' to avoid bad network topologies
> 
> 
>     Potential problems
>     ------------------
>     corosync uses a packet-based protocol, TCP is a stream (I don't see this
>     as a big problem, TBH)
>     Where to hook the message transmission in the corosync networking stack?
>       - We don't need a lot of the totem messages
>       - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in
>     satellites [CPG, so probably yes]?)
>     Which is client/server? (if satellites are client with authkey we get
>     easy failover and config, but ... DOS potential??)
>     What if tcp buffers get full? Suggest just cutting off the node.
>     How to stop satellites from running totemsrp?
>     Fencing, do we need it? (pacemaker problem?)
>     GFS2? is this needed/possible?
>     Keeping two node lists (totem/quorum and satellite) - duplicate node IDs
>     are not allowed and this will need to be enforced.
>     No real idea if this will scale as well as I hope it will!
> 
> 
>     How it will (possibly) work
>     ---------------------------
>     Totemsrp messages will be unaffected (in 1st revision at least),
>     satellites are not part of this protocol
>     Totempg messages are sent around the ring as usual.
>     When one arrives at a node with satellites, it forwards it around the
>     ring as usual, then it sends that message to all of the satellites
>     in turn.
>     If a send fails then the node is cut off and removed from the
>     configuration.
>     When a message is received from a satellite it is repackaged as a
>     totempg message and sent around the cluster as normal.
>     Satellite nodes will be handled by another corosync service that is
>     loaded.
>     Use a new corosync service handler to maintain extra nodes list and
>     (maybe) do the satellite forwarding.
> 
>     - Joining
>       A satellite sends a TCP connect and then a join request to its
>     nominated (or fallback) host.
>       The host can accept or reject this for reasons of (at least):
>        - duplicated nodeid
>        - no capacity
>        - bad key
>        - bad config
>       The service then sends new node information to the rest of the cluster
>       quorum is not affected
> 
>      - leaving
>        If a TCP send fails or a socket is disconnected then the node is
>     summarily removed
>        - there will probably also be a 'leave' message for tidy removal
>        - leave notifications are sent around the cluster so that CPG and the
>     secondary nodelist know.
>        - quorum does not need to know.
> 
>      - failover
>        Satellites have a list of all nodes (quorum and satellite) and if a
>     TCP connection
>        is broken then they can try to contact the next node in the nodeid
>     list of quorum nodes
> 
>     Timescales
>     ----------
>     Nothing decided at this stage, certainly Corosync 3.0 at the earliest as
>     it will break on-wire protocol.
>     Need to do a proof-of-concept, maybe using containers to get high node
>     count.
> 
> 
>     ------------------------------
> 
>     _______________________________________________
>     discuss mailing list
>     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>     http://lists.corosync.org/mailman/listinfo/discuss
> 
> 
>     End of discuss Digest, Vol 43, Issue 10
>     ***************************************
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux