Re: xattrs and bug 9

Ville Tuulos <tuulos@xxxxxxxxx> · Sat, 15 Aug 2009 21:12:30 -0700 (PDT)

On Fri, 14 Aug 2009, Anand Avati wrote:

I second that question.

Extended attributes are pretty much critical for Disco. It uses them to
decide where to execute tasks, to optimize data locality:

http://github.com/tuulos/disco/blob/c1d4ffadeba40af8a8547dd6afce562d267e464e/pydisco/disco/dfs/gluster.py#L36

If the extended attributes are really removed (I haven't upgraded yet to
2.0.6), what's the official way of finding out where files are physically
stored?

The reason we removed listing of Replicate's internal extended
attribute records was because we found commands like 'rsync -X' would
mess up and overwrite the extended attributes taking the filesystem to
an inconsistent state.

Ville, thanks for pointing that. We were not aware that these extended
attributes had found a new purpose for themselves this way :-) They
were not intended to be used this way at all. But for the same purpose
what you are talking about, we have introduced the virtual extended
attribute "trusted.glusterfs.location" which returns the hostname of
the storage/posix volume on which the file resides. But, this feature
is available only in mainline.

http://git.gluster.com/?p=glusterfs.git;a=commit;h=5be3c142978257032bd11ad420382859fc204702

Great! I'll update our systems to the latest git snapshot.

In fact the above patch was brought in with the intention of making
GlusterFS fit into map/reduce frameworks nicely in the future. Now
that you mention that this "feature" was already being used and got
broken in 2.0.6 (which we were not aware), we'll get the "official
way" of getting the hostname backported in 2.0.7. Note that the new
method will return the server's hostname and not any volume name. So
the gluster.py in disco.git might have to be modified to first look
for this "official" xattr and then fail back to the old style.

Hostname is even better for us than the volume name. Now the user has to 
provide a separate mapping for disco which maps volume names to hostnames.

We also want feedback from you guys about if/how you want the location
of file on multiple servers (for example Replicate could return
multiple locations, and stripe has the content distributed across
servers, possibly replicated as well). How and to what extent do the
map/reduce frameworks make use of such information? does record-level
location make sense at all?

Yes, we need locations of all replicas for each file. The current 
mechanism lists all replicas for each input, so Disco can resort to 
replicas if the master copy fails.

It would be great if trusted.glusterfs.location could return a list of 
hostnames. The list should be ordered according to the Gluster's 
preference to access the file, i.e. the second item should be the one that 
Gluster uses in case that the master copy fails etc. This ensures that 
Disco can preserve data locality even if individual volumes fail.

Striped files are not supported by Disco directly, so it doesn't do 
anything clever with them (yet). In general being able to query as much 
information as possible about files is beneficial.

It has been a deliberate choice to keep the storage layer separate from 
Disco. An upside of this design decision is that you're free to choose the 
best storage layer for your problem domain. For instance, I'm positive 
that Gluster is a good match for many adhoc data analysis tasks and rapid 
development in general. A downside is that coordination between the 
storage layer and the computation layer isn't always optimal.

I became interested in Gluster because a custom translator seemed like a 
reasonable way to bridge this gap. I was happy to notice that 95% of the 
benefits could be achieved with default translators, without the burden of 
maintaining a custom one.

I'm sure it'd benefit everybody if Gluster could continue supporting 
systems on top of it with minimal hassle by exposing other ways to 
interact(*) with the system than custom translators. With this respect, 
extended attributes and things like libglusterfs are really welcome 
features.

(*) in addition to querying the status of glusterfs (e.g. using extended 
attributes), it would be useful to _give_ information to Gluster as well. 
For instance, now I have to run two GlusterFS in parallel (inputfs and 
resultsfs in http://discoproject.org/doc/start/dfs.html), since only some 
directories need to be replicated (input data) whereas others are used 
over NUFA without replication (intermediate results). Disco could tag the 
latter temporary files with a special extended attribute, or by making a 
call to libglusterfs, so Gluster would know that replication is not 
needed.

Ville