GlusterFS and optimization of locality.

jdarcy at redhat.com (Jeff Darcy) · Wed, 03 Apr 2013 17:42:03 -0400

(redirecting to gluster-devel as a more appropriate forum)

On 04/03/2013 05:03 PM, Jay Vyas wrote:
> suppose I was going to serve a pedabyte of data sharded over 10 files
> (1,2,3,...,10) over glusterfs, in 3 servers (call them Server1, Server2,
> and Server3). 
> 
> The 3 servers would need access to the files such that :
> 
> Server 1 will usually only access file 1
> Server 2 will usually only access file2.
> Server 3 will access all ten files (the whole data set).
> 
> Is there a way to get gluster to rebalance bricks over time based on
> access patterns ... or otherwise .. what is the best way to increase the
> average locality of access to files in the cluster ?

The flippant answer would be to move the computation to the data instead
of vice versa, like Hadoop is designed to do.  ;)

The less flippant answer is going to get a bit more complicated.  There
are three ways that you can control placement of a file, but none are
really supported and all could get you in trouble.  The first method is
to create the file (or a copy) with a special name of the form
file at dht:subvol, where the parts have the following meanings:

* file = the file name you really want

* dht = the name of the DHT translator in your client-side volfile

* subvol = the name (from the same volfile) of the DHT subvolume where
you want the file to go

This is reasonably safe, because it's part of how rebalance works.  To
get even fancier than that, you need to know something about how the DHT
translator uses "layouts" on directories to place files.  There's a
description here.

http://hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/

The problem is that the user has very little control over how these
layouts are generated.  One thing you can do that's fairly easy is swap
the layout xattrs on two bricks, which (after a rebalance) will swap
what files they contain.  For example, if your file is on brick2 and you
want it to be on brick1, you swap the xattr values for that directory
within brick1 and brick2.

The ultimate level of control is to calculate your own layouts.  For
this to be useful in a scenario like yours, you'd need to copy or
reverse-engineer the code in the DHT translator that calculates the hash
for a file.  Knowing that, you could do something like this:

* assign a range for brick1 that contains the hash for file1

* assign a range for brick2 that contains the hash for file2

* assign the remaining range to brick3

I'm working on some mechanisms, and accompanying management/interface
models, to provide this sort of control in a less hacker-ish form.
Unfortunately, I'm tied down with about ten higher priorities, so I
don't have any idea when that will be ready.  In the meantime, please
try these techniques *only with test data*, and caveat emptor.