On Fri, Mar 22, 2013 at 10:09:44AM -0400, Jeff Darcy wrote: > During the Bangalore "architects' summit" a couple of weeks ago, there > was a discussion about making most functions of glusterd into Somebody > Else's Problem. Examples include cluster membership, storage of volume > configuration, and responding to changes in volume configuration. Have you looked at what GFS2 does for comparison? --b. > For > those who haven't looked at it, glusterd is a bit of a maintenance and > scalability problem with three kinds of RPC (client to glusterd, > glusterd to glusterd, glusterd to glusterfsd) and its own ad-hoc > transaction engine etc. The need for some change here is keenly felt > right now as we struggle to fix all of the race conditions that have > resulted from the hasty addition of synctasks to make up for poor > performance elsewhere in that 44K lines of C. Delegating as much as > possible of this functionality to mature code that is mostly maintained > elsewhere would be very beneficial. I've done some research since those > meetings, and here are some results. > > The most basic idea here is to use an existing coordination service to > store cluster configuration and state. That service would then take > responsibility for maintaining availability and consistency of the data > under its care. The best known example of such a coordination service > is Apache's ZooKeeper[1], but there are others that don't have the > noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3] > written in OCaml, ConCoord[4] written in Python. These all provide a > tightly consistent generally-hierarchical namespace for relatively small > amounts of data. In addition, there are two other features that might > be useful. > > * Watches: register for notification of changes to an object (or > directory/container), without having to poll. > > * Ephemerals: certain objects go away when the client that created them > drops its connection to the server(s). > > Here's a rough sketch of how we'd use such a service. > > * Membership: a certain small set of servers (three or more) would be > manually set up as coordination-service masters, e.g. via "peer probe > xxx as master"). Other servers would connect to these masters, which > would use ephemerals to update a "cluster map" object. Both clients and > servers could set up watches on the cluster map object to be notified of > servers joining and leaving. > > * Configuration: the information we currently store in each volume's > "info" file as the basis for generating volfiles (and perhaps the > volfiles themselves) would be stored in the configuration service. > Again, servers and clients could set watches on these objects to be > notified of changes and do the appropriate graph switches, reconfigures, > quorum actions, etc. > > * Maintenance operations: these would still run in glusterd (which isn't > going away). They would use the coordination for leader election to > make sure the same activity isn't started twice, and to keep status > updated in a way that allows other nodes to watch for changes. > > * Status queries: these would be handled entirely by querying objects > within the coordination service. > > Of the alternatives available to us, only ZooKeeper directly supports > all of the functionality we'd want. However, the Java dependency is > decidedly unpleasant for us and would be totally unacceptable to some of > our users. Doozer seems the closest of the remainder; it supports > watches but not ephemerals, so we'd either have to synthesize those on > top of doozer itself or find another way to handle membership (the only > place where we use that functionality) based on the features it does > have. The project also seems reasonably mature and active, though we'd > probably still have to devote some time to developing our own local > doozer expertise. > > In a similar vein, another possibility would be to use *ourselves* as > the coordination service, via a hand-configured AFR volume. This is > actually an approach Kaleb and I were seriously considering for HekaFS > at the time of the acquisition, and it's not without its benefits. > Using libgfapi we can prevent this special volume from having to be > mounted, and we already know how to secure the communications paths for > it (something that would require additional work with the other > solutions). On the other hand, it would probably require additional > translators to provide both ephemerals and watches, and might require > its own non-glusterd solution to issues like failure detection and > self-heal, so it doesn't exactly meet the "make it somebody else's > problem" criterion. > > In conclusion, I think our best (long term) way forward would be to > prototype a doozer-based version of glusterd. I could possibly be > persuaded to try a "gluster on gluster" approach instead, but at this > moment it wouldn't be my first choice. Are there any other suggestions > or objections before I forge ahead? > > [1] http://zookeeper.apache.org/ > [2] https://github.com/ha/doozerd > [3] http://arakoon.org/ > [4] http://openreplica.org/doc/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > https://lists.nongnu.org/mailman/listinfo/gluster-devel