Re: Glusterd: A New Hope

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Fri, 22 Mar 2013 16:33:55 -0400

On Fri, Mar 22, 2013 at 10:09:44AM -0400, Jeff Darcy wrote:
> During the Bangalore "architects' summit" a couple of weeks ago, there
> was a discussion about making most functions of glusterd into Somebody
> Else's Problem.  Examples include cluster membership, storage of volume
> configuration, and responding to changes in volume configuration.

Have you looked at what GFS2 does for comparison?

--b.

> For
> those who haven't looked at it, glusterd is a bit of a maintenance and
> scalability problem with three kinds of RPC (client to glusterd,
> glusterd to glusterd, glusterd to glusterfsd) and its own ad-hoc
> transaction engine etc.  The need for some change here is keenly felt
> right now as we struggle to fix all of the race conditions that have
> resulted from the hasty addition of synctasks to make up for poor
> performance elsewhere in that 44K lines of C.  Delegating as much as
> possible of this functionality to mature code that is mostly maintained
> elsewhere would be very beneficial.  I've done some research since those
> meetings, and here are some results.
> 
> The most basic idea here is to use an existing coordination service to
> store cluster configuration and state.  That service would then take
> responsibility for maintaining availability and consistency of the data
> under its care.  The best known example of such a coordination service
> is Apache's ZooKeeper[1], but there are others that don't have the
> noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]
> written in OCaml, ConCoord[4] written in Python.  These all provide a
> tightly consistent generally-hierarchical namespace for relatively small
> amounts of data.  In addition, there are two other features that might
> be useful.
> 
> * Watches: register for notification of changes to an object (or
> directory/container), without having to poll.
> 
> * Ephemerals: certain objects go away when the client that created them
> drops its connection to the server(s).
> 
> Here's a rough sketch of how we'd use such a service.
> 
> * Membership: a certain small set of servers (three or more) would be
> manually set up as coordination-service masters, e.g. via "peer probe
> xxx as master").  Other servers would connect to these masters, which
> would use ephemerals to update a "cluster map" object.  Both clients and
> servers could set up watches on the cluster map object to be notified of
> servers joining and leaving.
> 
> * Configuration: the information we currently store in each volume's
> "info" file as the basis for generating volfiles (and perhaps the
> volfiles themselves) would be stored in the configuration service.
> Again, servers and clients could set watches on these objects to be
> notified of changes and do the appropriate graph switches, reconfigures,
> quorum actions, etc.
> 
> * Maintenance operations: these would still run in glusterd (which isn't
> going away).  They would use the coordination for leader election to
> make sure the same activity isn't started twice, and to keep status
> updated in a way that allows other nodes to watch for changes.
> 
> * Status queries: these would be handled entirely by querying objects
> within the coordination service.
> 
> Of the alternatives available to us, only ZooKeeper directly supports
> all of the functionality we'd want.  However, the Java dependency is
> decidedly unpleasant for us and would be totally unacceptable to some of
> our users.  Doozer seems the closest of the remainder; it supports
> watches but not ephemerals, so we'd either have to synthesize those on
> top of doozer itself or find another way to handle membership (the only
> place where we use that functionality) based on the features it does
> have.  The project also seems reasonably mature and active, though we'd
> probably still have to devote some time to developing our own local
> doozer expertise.
> 
> In a similar vein, another possibility would be to use *ourselves* as
> the coordination service, via a hand-configured AFR volume.  This is
> actually an approach Kaleb and I were seriously considering for HekaFS
> at the time of the acquisition, and it's not without its benefits.
> Using libgfapi we can prevent this special volume from having to be
> mounted, and we already know how to secure the communications paths for
> it (something that would require additional work with the other
> solutions).  On the other hand, it would probably require additional
> translators to provide both ephemerals and watches, and might require
> its own non-glusterd solution to issues like failure detection and
> self-heal, so it doesn't exactly meet the "make it somebody else's
> problem" criterion.
> 
> In conclusion, I think our best (long term) way forward would be to
> prototype a doozer-based version of glusterd.  I could possibly be
> persuaded to try a "gluster on gluster" approach instead, but at this
> moment it wouldn't be my first choice.  Are there any other suggestions
> or objections before I forge ahead?
> 
> [1] http://zookeeper.apache.org/
> [2] https://github.com/ha/doozerd
> [3] http://arakoon.org/
> [4] http://openreplica.org/doc/
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> https://lists.nongnu.org/mailman/listinfo/gluster-devel