Questions from an Ignoramus

jdarcy at redhat.com (Jeff Darcy) · Thu, 10 Feb 2011 10:04:17 -0500

On 02/09/2011 08:36 PM, Doug Schouten wrote:
> Also, how robust is GlusterFS? We probably want to stripe the data to 
> improve performance, but if a server dies, does the file catalogue go 
> with it, resulting in total data loss? Or does the meta-data get 
> replicated somehow so that one can recover the partial files?

There is no central catalog or metadata server.  All GlusterFS data and
metadata is pretty directly reflected in data and metadata on the
servers' local filesystems, so *in general* you could take GlusterFS
entirely out of the picture and trivially reconstruct a unified view
just by copying all of those local filesystems into one place.  There
are two notable exceptions, though:

* If you use DHT/distribute, each file will exist as a complete local
file on one "brick" but there might also be "linkfiles" (zero length,
sticky bit set, distinctive xattrs) in the same place on other bricks.
If you were to attempt "all into one" recovery as described above, you'd
have to exclude the linkfiles or else they might overwrite (truncate)
the real files.

* If you use N-way striping, each file will exist as N files on N
bricks.  Each of these files will be non-zero-length but will also
contain only the data for 1/N blocks of the file; the rest will be
"holes" that read as zero but are actually unallocated space.  There
*is* information attached to each file (as xattrs) that identifies which
stripe component it is.  Recovery in this case would require reading
that information and using "dd" or similar to reassemble the N files
back into one, so it's a little more tedious than the non-striping case
but not prohibitively difficult.

As you can see, recovery is pretty simple, but it's also important to
keep in mind what happens between the time a server dies and when you
recover.  If you're using simple DHT and relying on RAID (instead of
GlusterFS replication) for data protection, you're still vulnerable to
failure of non-storage components on a server.  If such a failure were
to happen, then 1/N of your files - for all practical purposes at random
- would become inaccessible.  I've also seen problems with creating new
files which would be assigned to that "gap" in the hash space that is
how DHT distributes data, though most of these seem to have been fixed
in 3.1 or later.  It's very disconcerting when it happens.  My
recommendation would be to plan for migration to a scheme where each
server exposes two smaller "bricks" (which might still use RAID
internally) with GlusterFS replication between bricks on different
servers to protect fully against this kind of failure.