Good morning, A prototype failover system is now checked in that implements a simple instantiation and failover system for csnap servers. The csnap server was modified to connect to the agent and wait for an activation command before loading anything from the snapshot store. All nodes will now be running snapshot servers, but only one of them will be active at a time. This is to avoid any filesystem accesses in the server failover path, and to keep the number of syscalls used to a minimum so that they can all be audited for bounded memory use. Each node runs a single csnap agent that provides userspace services to all snapshot or origin clients of one csnap device. On receiving a connection from a server, the agent tries to make that server the master for the cluster, or learn the address of an existing master, using the following algorithm: Repeat as necessary: - Try to grab Protected Write without waiting - If we got it, write server address to the lvb, start it, done - Otherwise, convert to Concurrent Read without waiting - If there's a server address in the lvb, use it, done The agent will also attempt to do this any time a client requests a server connection, which it will do if its original server connection breaks. This algorithm looks simple, but it is racy: - Other nodes may read the lvb before the new master writes it, getting a stale address, particularly in the thundering rush to acquire the lock immediately after a server failure. - Other nodes may be using a stale address written by a previous master However, only one server can actually own the lock, and other servers will refuse (or discard) connections from clients that have stale addresses. So the race doesn't seem to hurt much, and this algorithm will do for the time being. Eventually this needs to be tightened up. I suspect that using the dlm to distribute addresses is fundamentally the wrong approach and that a verifiable algorithm must be based directly on membership events. The dlm should really be doing only the job that it does well: enforcing the exclusive lock on the snapshot store. That said, I have an alternate dlm-based algorithm in mind that uses blocking notifications: Every node does: - Grab exclusive (blocking asts are sent out) - if we got it, write the lvb and demote to PW Every node that gets a blocking ast does: - Demote to null, unblocking the exclusive above - Get CR, if the lvb has a server address we're done - Otherwise try to grab the exclusive again This not only closes the race between writing and reading the lvb, it sends notifications to all nodes that have stale server addresses (held in CR mode). So it's a little better, but it is still possible to get stale addresses if things die at just the wrong time. Both algorithms are potentially unbounded recursions. The csnap agents have no way of knowing whether somebody out there is just slow, or whether somebody is erroneously sitting on the exclusive lock without actually instantiating a server. So after some number of attempts, a human operator has to be notified, and the agent will just keep trying. I don't like this much at all, and it's one reason why I want to get away from lvb games, eventually. (I have a nagging feeling that lvb-based algorithms are only ever thought to be reliable when they get so complex that nobody understands them.) Currently, this is only implemented for gdlm. Gulm does not have PW or CR locks, but equivalent algorithms can be devised using more than one lock. If gulm is supported there will be a separate gulm-csnap-agent vs gdlm-csnap-agent, and a plain csnap-agent as well, for running snapshots on a single node without any locking libraries installed. The current prototype only supports IPv4, however IPv6 support only requires changes to the user space components. An agent must be running before a csnap server can be started or a csnap device can be created. Cman must be running before gdlm can be started, and ccsd must be running before cman will start. So a test run looks something like this: ccsd cman_tool join csnap-agent @test-control csnap-server /dev/test-origin /dev/test-snapstore 8080 @test-control echo 0 497976 csnapshot 0 /dev/test-origin /dev/test-snapstore \ @test-control | /sbin/dmsetup create testdev For what it's worth, the server and clients can be started in any order. The three bits of the csnap device are bound together by the @test-control named socket, which is fairly nice, it's hard to get this wrong. It's a little annoying that the device names have to be stated in two places, "you should never have to tell the computer something it already knows". It's tempting to make them optional on the server command line: the server can learn the device names from the device mapper target, or they can be given on the command line to run stand-alone. The device mapper device size (497976) in the device above is also redundant: the size is also encoded in the snapshot store metadata. It would be better for the device target to be told the size once it connects to a server, but that is not the way device mapper works at present. Instantiation and failover now seem to be under control, with the caveats above. The last bit of cluster infrastructure work needed is to teach the standby servers how to know when all the snapshot clients of a defunct server have either reconnected or left the cluster. Until recently I'd been planning to distribute the connection list to all the standby servers, but that is stupid: the local cluster manager already knows about the connections and the agent on every node is perfectly capable of keeping track of them on behalf of its standby server. Regards, Daniel