Re: [Linux-cluster] Interfacing csnap to cluster stack

Daniel Phillips <phillips@xxxxxxxxxx> · Fri, 8 Oct 2004 00:59:55 -0400

On Thursday 07 October 2004 23:56, David Teigland wrote:
> On Thu, Oct 07, 2004 at 03:35:47PM -0400, Daniel Phillips wrote:
> > The executive summary of your post is "my pristine, perfect service
> > manager is for symmetric systems only and keep yer steenking
> > client-server mitts away from it."
>
> Cute characterization, but false.  To quote the relevant point:
>
> "- I think it's possible that a client-server-based csnap system
> could be managed by SM (directly) if made to look and operate more
> symmetrically. This would eliminate RM from the picture."
>
> I reiterated this in the next point and have said it before.  In
> fact, I think this sort of design, if done properly, could be quite
> nice.  I'm not lobbying for one particular way of solving this
> problem, though.

If you think only of csnap agents and forget for the moment about device 
mapper targets and servers, the agents seem to match the service group 
model quite well.  There is one per node, and each provides the service 
"able to launch a csnap server".  The recovery framework seems useful 
for ensuring that a server is never launched on a node that has left 
the cluster.  How to choose a good candidate node is still an open 
question, but starting Lon's "cute" proposal to use gdlm to both choose 
a candidate and ensure that the server is unique will certainly get 
something working.  In the long run, taking an EX lock on the snapshot 
store seems like a very good thing for a server to do.  This gets the 
resource manager off the critical (development) path.

Besides the server instantiation question, there is another problem that 
needs solving: when a snapshot server fails over to a new server, the 
new server must be sure that every client that was connected to the old 
server has either reconnected to the new server or left the cluster.

Csnap clients don't map directly onto nodes, so cnxman can't directly 
track the csnap client list, however it can provide membership change 
events that the server (or alternatively, agents) can use to maintain 
the list of currently connected clients.  (The server doesn't need help 
adding new clients to the list, but it needs to be told when a node has 
left the cluster, so it can strike the clients belonging to that node 
off the list, and disconnect them for good measure.  It could also 
refuse connections from clients not on cluster nodes.)

Since the list of clients isn't large and doesn't change very fast, the 
server can reasonably require every csnap agent to replicate it.  So 
when a server fails over, it can retrieve the list from the first agent 
that reconnects, and thus be able to tell when it is safe to continue 
servicing requests.

Regards,

Daniel