Hi all, Ben has been working on the cluster infrastructure interface for cluster snapshot server failover for the last little whiles, and in that time, the two of us have covered a fair amount of conceptual ground, figuring out how to handle this reasonably elegantly and racelessly. It turned out to be a _lot_ harder than I first expected. It's just not very easy to handle all the races that come up. However, I think we've got something worked out that is fairly solid, and should work as a model for server failover in general, not just for cluster snapshot. We're also looking to prove that the cluster infrastructure interface we intend to propose to the community in general, does in fact provide the right hooks to tackle such failover problems, including internode messaging that works reliably, even when the membersip of the cluster is changing while the failover protocols are in progress. More than once I've wondered if we need a stronger formalism, such as virtual synchrony, to do the job properly, or if such a formalism would simplify the implementation. At this point, I have a tentative "no" for both questons. There appears to be no message ordering going on here that we can't take care of accurately with the simple cluster messaging facility available. The winning strategy just seems to be to enumerate all the possible events, make sure each is handled, and arrange things so that ordering does not matter where it cannot conveniently be controlled. Ben, the algorithm down below is based loosely on your code, but covers some details that came up while I was putting pen to paper. No doubt I've made a booboo or two somewhere, please shout. On to the problem. We are dealing with failover of a cluster snapshot server, where some clients have to upload state (read locks) to the new server. On server failover, we have to enforce the following rule: 1) Before the new server can begin to service requests, all snapshot clients that were connected to the previous server must have either reconnected (and uploaded their read locks) or left the cluster and for client connection failover: 2) If a snapshot client disconnects, the server can't throw away its locks until it knows the client has really terminated (it might just have suffered a temporary disconnection) Though these rules look simple, they blow up into a fairly complex interaction between clients, the server, agents and the cluster manager. Rule (1) is enforced by having the new server's agent carry out a roll call of all nodes, to see which of them had snapshot clients that may have been connected to the failed server. A complicating factor is, in the middle of carrying out the roll call, we have to handle client joins and parts, and node joins and parts. (What is an agent? Each node has a csnap agent running on it that takes care of connecting any number of device mapper csnap clients on the node to a csnap server somewhere on the cluster, controls activating a csnap server on the node if need be, and controls server failover) Most of the complexity that arises lands on the agent rather than the server, which is nice because the server is already sufficiently complex. The server's job is just to accept connections that agents make to it on behalf of their clients, and to notify its own agent of those connections so the agent can complete the roll call. The server's agent will also tell the server about any clients that have parted, as opposed to just being temporarily disconnected, in order to implement rule (2). One subtle point: in normal operation (that is, after failover is completed) when a client connects to its local agent and requests a server, the agent will forward a client join message to the server's agent. The agent responds to the join message, and only then the remote agent establishes the client's connection to the server. It has to be done this way because the remote agent can't allow a client to connect until it has received confirmation from the server agent that the client join message was received. Otherwise, if the remote agent's node parts there will be no way to know the client has parted. Messages There are a fair number of messages that fly around, collated here to help put the event descriptions in context: The server agent will send: - roll call request to remote agents - roll call acknowledgement to remote agent - client join acknowledgement to remote agent - activation message to server The server agent will receive: - initial node list from cluster manager - node part message from cluster manager - snapshot client connection message from server - roll call response from remote agent - client join message from remote agent - client part message from remote agent - connection from local client - disconnection from local client A remote agent will receive: - roll call request from server agent - roll call acknowledgement from server agent - connection from local client - disconnection from local client - client join acknowledgement from server agent A remote agent will send: - roll call response to server agent - client join message to server agent - client part message to server agent A server will receive: - client connection - activation message from server agent - client part from server agent A server will send: - client connection message to server agent Note: the server does not tell the agent about disconnections; instead, the remote agent tells the server agent about client part Note: each agent keeps a list of local client connections. It only actually has to forward clients that were connected to the failed server, but it's less bookkeeping and does no harm just to forward all. Agent Data - list of local snapshot client connections - roll call list of nodes - used only for server failover - client list of node:client pairs - used for server failover - used for snapshot client permanent disconnection - each client has state "failover" or "normal" - a failover count - how many snapshot client connections still expected Event handling Server Agent Acquires the server lock (failover): - get the current list of nodes from the cluster manager - Receive it any time before activating the new server - Some recently joined nodes may never have been connected to the old server and it doesn't matter - Don't add nodes that join after this point to the roll call list, they can't possibly have been connected to the old server - set failover count to zero - send a roll call request to the agent on every node in the list Receives a roll call response: - remove the responding node from the roll call list - if the node wasn't on the roll call list, what? - each response has a list of snapshot clients, add each to the client list - set state to failover - increase the failover count - send roll call response acknowledgement to remote agent - If roll call list empty and failover count zero, activate server Receives a node part event from cluster manager: - remove the node from the roll call list, if present - remove each matching client from the client list - if client state was normal send client part to server - if client state was failover, decrease the failover count - If roll call list empty and failover count zero, activate server Receives a snapshot client connection message from server: - if the client wasn't on the client list, what? - if the client state was failover - decrease the failover count - set client state to normal - If roll call list empty and failover count zero, activate server Receives a client join from a remote agent: - add the client to the client list in state normal - send client join acknowledgement to remote agent Receives a client part from a remote agent: - if client wasn't on the client list, what? - remove the client from the client list - if client state was normal send client part to server - if client state was failover, decrease the failover count - If roll call list empty and failover count zero, activate server Receives a snapshot client connection from device: - Send client join message to self Receives a snapshot client disconnection from device: - Send client part message to self Remote Agent Receives a snapshot client connection from device: - If roll call response has already been sent, send client join message to server agent - otherwise just add client to local snapshot client list Receives a snapshot client disconnection from device: - If roll call response has already been sent, send client part message to server agent - otherwise just remove client from local snapshot client list Receives a roll call request: - Send entire list of local snapshot clients to remote agent Receives acknowledgement of roll call response from server agent: - Connect entire list of local snapshot clients to server Receives client join acknowledgement from server agent: - Connect client to server Server Receives a client connection - Send client connection message to server agent Easy, huh? Daniel