Some questions about theoretical gluster failures.

jdarcy at redhat.com (Jeff Darcy) · Wed, 26 Oct 2011 09:34:33 -0400

On Tue, 25 Oct 2011 19:01:33 -0700
Harry Mangalam <harry.mangalam at uci.edu> wrote:

> We're considering implementing gluster for a genomics cluster, and it 
> seems to have some theoretical advantages that so far seem to have 
> been borne out in some limited testing, mod some odd problems with an 
> inability to delete dir trees.  I'm about to test with the latest
> beta that was promised to clear up these bugs, but as I'm doing that, 
> answers to these Qs would be appraciated...
> 
> - what happens in a distributed system if a node goes down?  Does the 
> rest of the system keep working with the files on that brick 
> unavailable until it comes back or is the filesystem corrupted?  In
> my testing, it seemed that the system indeed kept working and added
> files to the remaining systems, but that files that were hashed to
> the failed volume were unavailable (of course).

Yes, this is what I would expect (and have always observed) when using
just distribution without replication.  Not only are existing files
on the failed brick unavailable, but IMX attempts to create new
files which would hash to that brick (effectively a random 1/N) also
fail.  That part, at least, is fixable.  With replication, the
single-brick failure would effectively be invisible to the distribution
layer so even this glitch wouldn't occur.

> - is there a head node?  the system is distributed but you're
> mounting a specific node for the glusterfs mount - if that node goes
> down, is the whole filesystem hosed or is that node reference really
> a group reference and the gluster filesystem continues with the loss
> of that node's files?  ie can any gluster node replace a mountpoint
> node and does that happen transparently? (I haven't tested this).

The node that you specify for the mount is really only used to fetch
the volfile, which contains the names of all bricks that are involved in
providing service for that volume.  The mount node might not even be
one of those nodes itself (e.g. mount from gluster1, bricks are
actually on gluster2 and gluster3).  Once the connections have been
made to each brick, they're all equal and the failure of one will have
only partial (if any) effect.

> - can you intermix distributed and mirrored volumes?  This is of 
> particular interest since some of our users want to have replicated 
> data and some don't care.

Every volume is inherently distributed (even if there's only one
brick), and can optionally be striped and/or replicated as well
independently of what's being done for other volumes.