On Wed, Oct 20, 2004 at 12:17:56PM -0400, Daniel Phillips wrote: > Hi Ben, > > The next bit of cluster infrastructure glue that we need is the > interface between csnap server startup and cluster membership. The > problem we need to solve is: a new server must be sure that every > snapshot client (as opposed to origin client) of the former server has > either reconnected or left the cluster. This is because snapshot > clients take read locks on chunks that they read from the origin. If a > new server ignores those locks it could allow an origin client to > overwrite a chunk that a snapshot client is reading. > > I was originally thinking about replicating the list of snapshot clients > across all the agents using a protocol between the server, clients and > agents, so that a new server always has the current list available. > But this is stupid, because cman already keeps the list of cluster > nodes on every node, and there is a csnap agent on every node that > knows about all the snapshot clients connected to itself. So a direct > approach is possible, as follows: o.k. just to clarify things, because I haven't looked completely through agent.c and you changes to the csnap server code yet: On each machine, there is one agent per csnap device. right? > - Once an agent succeeds in getting the exclusive on the snapshot > store lock, it sends a "new server" message to the csnap agent > on every node (alternatively, to every member of the "csnap" > service group, see below). How does the agent know the ip addresses of all client nodes if not through cman? Even through cman, is there an easy way to get the ip address from the cman information? Or were the agents going to use cluster sockets. Do all the agents wait on a specific port for these external connections? If there is only one service group for all csnap devices, but a different csnap agent per client, what happens if a node doesn't use all the csnap devices? It would seem in this case that according to the service group, it would need to respond, but there wouldn't be an agent to contact. correct? To avoid this you might need to have one service group per csnap device, or one agent that handles all csnap devices on a node. > - Each agent responds by sending the (possibly empty) list of snapshot > client ids back to the new server's agent. > > - The new server's agent must keep track of membership events to know > if the number of replies it is expecting is reduced (we don't care > about any new node because it could not possibly have been connected > to the old server). > > - When all nodes have replied, the new server's agent forwards the > combined list of client+node ids to its local csnap server and > activates it. > > - The new csnap server must receive connections from each of the > snapshot clients before it will service any origin writes (might as > well not service anything until ready, it's simple). > > - If any snapshot client goes away (closes its control connection > with the agent) the local agent will know, and must connect on > behalf of the departed client, and immediately disconnect. It is > perfectly reasonable for a client to disappear in this way: it > corresponds to a user unmounting a snapshot from that node. > > Now, this relies on the certainty that there is a csnap agent on every > cluster node. If there is not, then some nodes will never answer and > the algorithm will never terminate. The question is, do we require the > cluster device to be configured the same way on every node? For > example, you could export a snapshot device via gnbd to nodes that do > not require csnap devices. If we allow such things (I think we should) > then we probably want csnap agents to form a service group. Instead of > messaging all the nodes in the cluster, we message the members of the > service group. > > Note: we can broadcast the csnap server address along with the "new > server" message instead of fiddling with the lvb, so the snapshot store > lock goes back to being just that instead of trying to be a messaging > system as well. > > This algorithm ought to run in a few hundredths of a second even for > large clusters, which will be the server failover time. The new server > can be initializing itself in parallel, i.e., reading the superblock > and recovering the journal. So this should be pretty fast. > > Would you like to take a run at implementing this? As far as I can see, > cman usespace interface documentation consists of the source code, and > some working code in clvmd and magma, so there is some digging to do. > > Regards, > > Daniel