[Linux-cluster] Re: Csnap instantiation and failover using libdlm

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Tue, 19 Oct 2004 10:20:03 -0500

On Tue, Oct 19, 2004 at 01:42:56AM -0400, Daniel Phillips wrote:
> 
> Instantiation and failover now seem to be under control, with the 
> caveats above.

Daniel, I think this is a perfectly reasonable method of failing the server
over (I'd like the clients not to be dependant on a userspace process for
reconnection, but that's another issue).  Only, unless I am misunderstanding
something, it seems to go directly against one of your earlier requirements.

The issue is failure detection.  Previously, you indicated that you were in
favor of failure detection by the client or at the very least, some outside
agent.  As far as I understand the method you are implementing, As long as the
server doesn't give up the lock, it will treated as healthy.  Is their some
method for the lock to be revoked, or some sort of heart-beating, or have you
just relaxed that requirement.

O.k. stupid question time: If a userspace process graps this exclusive lock, and
then dies unexpectedly, does the lock automatically get freed? If not, who
is freeing the lock?  I'm probably missing something here, but I don't quite
understand how server failure detection will work.

-Ben

The last bit of cluster infrastructure work needed is 
> to teach the standby servers how to know when all the snapshot clients 
> of a defunct server have either reconnected or left the cluster.  Until 
> recently I'd been planning to distribute the connection list to all the 
> standby servers, but that is stupid: the local cluster manager already 
> knows about the connections and the agent on every node is perfectly 
> capable of keeping track of them on behalf of its standby server.
> 
> Regards,
> 
> Daniel