On Thursday 07 October 2004 18:57, Daniel McNeil wrote: > Daniel, > > Maybe you should describe what kind of help you are looking for > from the infrastructure? Sure, there are two separate problems: 1) Resource management - The resource to be instantiated is the csnap server. - There may never be more than one, or the snapshot metadata will be corrupted (this sounds like a good job for gdlm: let the server take an exclusive lock on the snapshot store). - Server instance requests come from csnap agents, one per node. The reply to an instance request is always a server address and port, whether the server had to be instantiated or was already running. - If the resource manager determines no server is running, then it must instantiate one, by picking one of the cluster nodes, finding the csnap agent on it, and requesting that the agent start a server. - When instantiated in a failover path, the local part of the failover path must restrict itself to bounded memory use. Only a limited set of syscalls may be used in the entire failover path, and all must be known. Accessing a host filesystem is pretty much out of the question, as is on-demand library or plugin loading. If anything like this is required, it must be done at initialization time, not during failover. 2) Membership - If a snapshot client disconnects, the server needs to know if it is coming back or has left the cluster, so that it can decide whether to release the client's read locks. - If a server fails over, the new incarnation needs to know that all snapshot clients of the former incarnation have either reconnected or left the cluster. - There exists a snapshot client protocol variation that adds an additional message (confirmation of read lock release) and allows the snapshot server to ignore cluster membership entirely, This is a way of wimping out instead of dealing with interface issues. - Origin clients don't present a problem, they don't hold locks. Regards, Daniel