glusterfs for cloud storage

wdong.pku at gmail.com (Wei Dong) · Thu, 20 Aug 2009 12:15:48 -0400

Hi All,

We are using glusterfs on our lab cluster for a shared storage to save a 
large number of image files, about 30 million at the moment.  We use 
Hadoop for distributed computing, but we are reluctant to store small 
files on hadoop for it's low throughput on small files and also the 
non-standard filesystem interface (e.g. we won't be able to run convert 
on each image to produce a thumbnail if the files are stored in 
hadoop).  What we do now is to store a list of paths to all images in 
hadoop, and use Hadoop streaming to pipe the paths to some script, which 
will then read the images from glusterfs filesystem and do the 
processing.  This has been working for a while so long as glusterfs 
doesn't hang, but the problem is that we basically lose all data 
locality.  We have 66 nodes and the chance that a needed file is on 
local disk is only 1/66, and 55/66 of file I/O has to go through 
network, which make me very uncomfortable.  I'm wondering if there's a 
better way of making glusterfs and Hadoop work together to take the 
advantage of data locality.

I know that there's a nufa translator which gives high preference to 
local drive.  This is good enough if the assignment of files to nodes is 
fixed.  But if we want to assign files to nodes according to the 
location of the file, what interface should we use to get the physical 
location of the file?

I appreciate all your suggestions.

- Wei Dong