During the Bangalore "architects' summit" a couple of weeks ago, there was a discussion about making most functions of glusterd into Somebody Else's Problem. Examples include cluster membership, storage of volume configuration, and responding to changes in volume configuration. For those who haven't looked at it, glusterd is a bit of a maintenance and scalability problem with three kinds of RPC (client to glusterd, glusterd to glusterd, glusterd to glusterfsd) and its own ad-hoc transaction engine etc. The need for some change here is keenly felt right now as we struggle to fix all of the race conditions that have resulted from the hasty addition of synctasks to make up for poor performance elsewhere in that 44K lines of C. Delegating as much as possible of this functionality to mature code that is mostly maintained elsewhere would be very beneficial. I've done some research since those meetings, and here are some results. The most basic idea here is to use an existing coordination service to store cluster configuration and state. That service would then take responsibility for maintaining availability and consistency of the data under its care. The best known example of such a coordination service is Apache's ZooKeeper[1], but there are others that don't have the noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3] written in OCaml, ConCoord[4] written in Python. These all provide a tightly consistent generally-hierarchical namespace for relatively small amounts of data. In addition, there are two other features that might be useful. * Watches: register for notification of changes to an object (or directory/container), without having to poll. * Ephemerals: certain objects go away when the client that created them drops its connection to the server(s). Here's a rough sketch of how we'd use such a service. * Membership: a certain small set of servers (three or more) would be manually set up as coordination-service masters, e.g. via "peer probe xxx as master"). Other servers would connect to these masters, which would use ephemerals to update a "cluster map" object. Both clients and servers could set up watches on the cluster map object to be notified of servers joining and leaving. * Configuration: the information we currently store in each volume's "info" file as the basis for generating volfiles (and perhaps the volfiles themselves) would be stored in the configuration service. Again, servers and clients could set watches on these objects to be notified of changes and do the appropriate graph switches, reconfigures, quorum actions, etc. * Maintenance operations: these would still run in glusterd (which isn't going away). They would use the coordination for leader election to make sure the same activity isn't started twice, and to keep status updated in a way that allows other nodes to watch for changes. * Status queries: these would be handled entirely by querying objects within the coordination service. Of the alternatives available to us, only ZooKeeper directly supports all of the functionality we'd want. However, the Java dependency is decidedly unpleasant for us and would be totally unacceptable to some of our users. Doozer seems the closest of the remainder; it supports watches but not ephemerals, so we'd either have to synthesize those on top of doozer itself or find another way to handle membership (the only place where we use that functionality) based on the features it does have. The project also seems reasonably mature and active, though we'd probably still have to devote some time to developing our own local doozer expertise. In a similar vein, another possibility would be to use *ourselves* as the coordination service, via a hand-configured AFR volume. This is actually an approach Kaleb and I were seriously considering for HekaFS at the time of the acquisition, and it's not without its benefits. Using libgfapi we can prevent this special volume from having to be mounted, and we already know how to secure the communications paths for it (something that would require additional work with the other solutions). On the other hand, it would probably require additional translators to provide both ephemerals and watches, and might require its own non-glusterd solution to issues like failure detection and self-heal, so it doesn't exactly meet the "make it somebody else's problem" criterion. In conclusion, I think our best (long term) way forward would be to prototype a doozer-based version of glusterd. I could possibly be persuaded to try a "gluster on gluster" approach instead, but at this moment it wouldn't be my first choice. Are there any other suggestions or objections before I forge ahead? [1] http://zookeeper.apache.org/ [2] https://github.com/ha/doozerd [3] http://arakoon.org/ [4] http://openreplica.org/doc/