Hi All, We are using glusterfs on our lab cluster for a shared storage to save a large number of image files, about 30 million at the moment. We use Hadoop for distributed computing, but we are reluctant to store small files on hadoop for it's low throughput on small files and also the non-standard filesystem interface (e.g. we won't be able to run convert on each image to produce a thumbnail if the files are stored in hadoop). What we do now is to store a list of paths to all images in hadoop, and use Hadoop streaming to pipe the paths to some script, which will then read the images from glusterfs filesystem and do the processing. This has been working for a while so long as glusterfs doesn't hang, but the problem is that we basically lose all data locality. We have 66 nodes and the chance that a needed file is on local disk is only 1/66, and 55/66 of file I/O has to go through network, which make me very uncomfortable. I'm wondering if there's a better way of making glusterfs and Hadoop work together to take the advantage of data locality. I know that there's a nufa translator which gives high preference to local drive. This is good enough if the assignment of files to nodes is fixed. But if we want to assign files to nodes according to the location of the file, what interface should we use to get the physical location of the file? I appreciate all your suggestions. - Wei Dong