On Tue, Oct 19, 2004 at 01:42:56AM -0400, Daniel Phillips wrote: > > Instantiation and failover now seem to be under control, with the > caveats above. Daniel, I think this is a perfectly reasonable method of failing the server over (I'd like the clients not to be dependant on a userspace process for reconnection, but that's another issue). Only, unless I am misunderstanding something, it seems to go directly against one of your earlier requirements. The issue is failure detection. Previously, you indicated that you were in favor of failure detection by the client or at the very least, some outside agent. As far as I understand the method you are implementing, As long as the server doesn't give up the lock, it will treated as healthy. Is their some method for the lock to be revoked, or some sort of heart-beating, or have you just relaxed that requirement. O.k. stupid question time: If a userspace process graps this exclusive lock, and then dies unexpectedly, does the lock automatically get freed? If not, who is freeing the lock? I'm probably missing something here, but I don't quite understand how server failure detection will work. -Ben The last bit of cluster infrastructure work needed is > to teach the standby servers how to know when all the snapshot clients > of a defunct server have either reconnected or left the cluster. Until > recently I'd been planning to distribute the connection list to all the > standby servers, but that is stupid: the local cluster manager already > knows about the connections and the agent on every node is perfectly > capable of keeping track of them on behalf of its standby server. > > Regards, > > Daniel