Storage Design Overview

jdarcy at redhat.com (Jeff Darcy) · Wed, 11 May 2011 10:52:43 -0400

On 05/11/2011 08:35 AM, Nyamul Hassan wrote:
> 1.  Can we mount a GlusterFS on a client and expect it to provide 
> sustained throughput near wirespeed?  Is there any subjective
> comparison between reading from GlusterFS and reading from local
> drives?  Does it put extra pressure on the client?

As others have said, it depends on what you're doing.  I've measured over
900MB/s per server across three servers from one client using 10GbE, but
only for very well-behaved workloads - large sequential I/O with many
concurrent threads.  Use fewer threads or smaller requests, throw in more
random or synchronous operations, do more stuff with metadata than data,
and your performance will drop from very good to quite poor.

> 2.  Reliability + Scalability means Distributed Replicated volumes. 
> Initially this might be enough for our needs, but as our read 
> requirements grow, the Striped option looks promising.  Is it
> possible to mix Distributed + Replicated + Striped?

It is possible in the code, but there's no configuration support for it.
In other words, you can't do it with "gluster volume create" and any
"gluster volume set" is likely to undo any manual changes you had made
to the volume configuration files.

I've generally found the stripe translator to be of very limited use
anyway.  The overhead from splitting and reassembling requests, and even
from just having another translator in the stack, has overwhelmed any
advantage from splitting an individual I/O across connections in every
test I've done.  The only compelling reason I've heard for using the
stripe translator has nothing to do with performance.  Without striping,
the maximum size of a file is limited to the maximum available space on
any one brick.  You can use striping to distribute the space used by
that file across multiple bricks and thus get beyond that limit.

> 3.  What happens to very large files.  Say 100 GB files.  Are they
> kept as a single file in every node that has the file?  Or is it
> split up and distributed in blocks?

The "distribute" translator (a.k.a. DHT) will place the entire contents
of a file onto one of its component subvolumes - a single brick, or a
replica set if you're doing replication as well.  Without striping,
that's the end of the story.  With striping, DHT will place the entire
file onto one stripe set, which will then store the contents on N bricks
or replica sets below that.  See above for an explanation of why this
might be useful in some very limited cases.