On Tuesday 19 October 2004 11:20, Benjamin Marzinski wrote: > Daniel, I think this is a perfectly reasonable method of failing the > server over (I'd like the clients not to be dependant on a userspace > process for reconnection, but that's another issue). Only, unless I > am misunderstanding something, it seems to go directly against one of > your earlier requirements. > > The issue is failure detection. Previously, you indicated that you > were in favor of failure detection by the client or at the very > least, some outside agent. As far as I understand the method you are > implementing, As long as the server doesn't give up the lock, it will > treated as healthy. As always, the client detects a broken server connection and asks for a new connection. > Is their some method for the lock to be revoked, Killing the agent that has it should do the job, which would be part of stomith. There also has to be a way of giving up the lock gracefully when a node exits the cluster voluntarily. I neglected to mention "graceful node exit and cleanup" as another bit of infrastructure glue still needed. > or some sort of heart-beating, or have you just relaxed that > requirement. I speculated that eventually, the kernel client might heartbeat its connection, I think I used the term "ping". That still seems like a good idea: then the client can detect connection failure when it occurs, not just when it attempts to service a request over the connection. Nothing else changes. There is also nothing preventing heartbeating at the node level as well. The csnap bits do not have to participate in it. If the node-level heartbeat fails the node will be ejected and will receive a membership event, which the csnap agent will pick up when that part is implemented. > O.k. stupid question time: If a userspace process graps this > exclusive lock, and then dies unexpectedly, does the lock > automatically get freed? Yes. Though I haven't looked closely at this, it seems the locks are cleaned up when the fd that libdlm creates to pass lock completions to userspace is closed. It's not strictly coupled to process exit, it's even more sensible. > If not, who is freeing the lock? I'm > probably missing something here, but I don't quite understand how > server failure detection will work. The lock is really there primarily to enforce exclusive ownership of the snapshot store device. If the client says the connection is bad, the agent will believe the client and initiate recovery using the algorithm above, which is more or less functional but is never going to be entirely satisfactory until it incorporates membership events explicitly. Regards, Daniel