On Thursday 07 October 2004 23:56, David Teigland wrote: > On Thu, Oct 07, 2004 at 03:35:47PM -0400, Daniel Phillips wrote: > > The executive summary of your post is "my pristine, perfect service > > manager is for symmetric systems only and keep yer steenking > > client-server mitts away from it." > > Cute characterization, but false. To quote the relevant point: > > "- I think it's possible that a client-server-based csnap system > could be managed by SM (directly) if made to look and operate more > symmetrically. This would eliminate RM from the picture." > > I reiterated this in the next point and have said it before. In > fact, I think this sort of design, if done properly, could be quite > nice. I'm not lobbying for one particular way of solving this > problem, though. If you think only of csnap agents and forget for the moment about device mapper targets and servers, the agents seem to match the service group model quite well. There is one per node, and each provides the service "able to launch a csnap server". The recovery framework seems useful for ensuring that a server is never launched on a node that has left the cluster. How to choose a good candidate node is still an open question, but starting Lon's "cute" proposal to use gdlm to both choose a candidate and ensure that the server is unique will certainly get something working. In the long run, taking an EX lock on the snapshot store seems like a very good thing for a server to do. This gets the resource manager off the critical (development) path. Besides the server instantiation question, there is another problem that needs solving: when a snapshot server fails over to a new server, the new server must be sure that every client that was connected to the old server has either reconnected to the new server or left the cluster. Csnap clients don't map directly onto nodes, so cnxman can't directly track the csnap client list, however it can provide membership change events that the server (or alternatively, agents) can use to maintain the list of currently connected clients. (The server doesn't need help adding new clients to the list, but it needs to be told when a node has left the cluster, so it can strike the clients belonging to that node off the list, and disconnect them for good measure. It could also refuse connections from clients not on cluster nodes.) Since the list of clients isn't large and doesn't change very fast, the server can reasonably require every csnap agent to replicate it. So when a server fails over, it can retrieve the list from the first agent that reconnects, and thus be able to tell when it is safe to continue servicing requests. Regards, Daniel