small files and cluster/stripe

jonah at eecs.berkeley.edu (Jeff Anderson-Lee) · Thu, 13 May 2010 17:27:45 -0700

On 5/13/2010 5:05 PM, Craig Carl wrote:
> Jeff -
>    Two comments/ideas.
>
> 1. If you are limited to four pieces of hardware, the minimum for 
> stripe, and you want to stripe some of the data and just distribute 
> other files there is a way to do that. Ideally you would use your 
> hardware RAID controllers to create two LUNs on each host, one for 
> distribute, the other for stripe. If you don't have hardware RAID you 
> could use LVM2 or ZFS to achieve the same thing. (or you could use 
> folders)
>        1a. Once you have two file systems created use 
> glusterfs-volgen  to create the vol files for the distribute export 
> just like you normally would.
>        1b. Move the files you just created to the storage servers and 
> clients.
>        1c. Re-run glusterfs-volgen this time for the stripe, adding 
> the -p option and specifying a port. (something above 1024, not 6996).
>        1d. Move the files you just created to the storage servers and 
> clients.
>        1e . Start Gluster twice on all the servers, specifying the 
> different vol files.
>        1f.  You now have two GlusterFS exports, one distribute, the 
> other mirror.
>
>        1g. You can mount one inside the other on the client if that 
> makes management easier.
> There are advantages to this model, having two separate Gluster 
> instances significantly improves parallelism on the storage servers. 
> You can manage the two instances as if they are on different iron.
>
>
> 2. The use case for stripe is vanishingly small. If you have very 
> large files (at least 2X the amount of memory in your storage servers 
> and a minimum of 50GB) with very limited writes and simultaneous 
> access from hundreds of clients then maybe stripe might be 
> appropriate. Stripe was designed for a specific type of HPC problem 
> solving, not general file serving. Our video streaming users don't use 
> stripe, even though that is an obvious use, there are better ways to 
> configure Gluster for that. If you could share the type of 
> content/access methods/iops per sec we could make some specific 
> suggestions.
>

We *are* a quasi-HPC environment.  We have 100+ batch compute servers 
with 500+ cores, all with GbT interfaces, pounding on an old NAS storage 
server.  We are trying to replace the old shared staging area with new 
hardware.  We've been looking at an Isilon solution, which performs well 
for the task but costs 4x to 5x what a Gluster solution would price out 
at for similar-sized hardware/space.

Some our users have millions of small files, some have thousands of 
large files, some have one or two humongous files.  If all the data was 
just one size or another all would be well.  All files are currently 
stored in the same shared staging area.  Our users are not HPC 
programmers and tend to program in HLL such as matlab, so we try to be 
as accommodating as possible, rather than force them to manage the data 
distribution.

We'd love a solution that would (a) spread small files over multiple 
volumes as well as (b) spread large files over multiple volumes.   
Cluster/distribute would work for the former and cluster/stripe for the 
latter.  A marriage of the two would be great.

Right now I'm trying to patch together a temporary testbed using a bunch 
of old machines with two 143GB drives each.  The problem is that many 
files are multi-GB and unless they are striped they could easily fill up 
a volume with poor hash distributions.  Likewise many small files could 
swamp the low-end disk in a stripe volume.

I suppose we could create two pools and tell the predominantly small 
file users to use one and the predominantly large file users to use the 
other, but somehow I would not hold my breath on it working out.

Jeff