On Thursday 21 October 2004 17:56, Benjamin Marzinski wrote: > Um.. I just realized that there's a problem here. > If the agent dies but the server doesn't, the lock will get revoked. > While this won't interfere with the clients currently connected to > the server, any new client (or client that gets disconnected) will > think that there is no server, and promote it's server to master.... > and data corruption will follow. > > As far as I can tell, the way to ensure that this doesn't happen is > to have the server process take out the lock. That way the lock won't > be freed unless the server process dies. Agreed? No, the way to ensure this is to have the server die if its control socket goes away. However, you have pointed out why it's bad for the new server to rely only on the lock to decide when its safe to start processing requests, or even to recover the journal: there may still be writes in flight from the old server. If a server dies but its node is still in the cluster, the new server's agent has to regard that as a valid reason for fencing the node. This can only be handled properly at the membership level, not at the lock level. > If that's the case, should the server also be responsible for > contacting the agents in the appropriate service group and getting > the client information? It's not the case, so we don't have to worry about it. The only interesting argument I know of for moving infrastructure details into the server is to get rid of one daemon, but daemons are cheap, particularly if they sleep nearly all the time like the agent does. It's better to keep the agent and daemon separate and specialized for the time being. Regards, Daniel