Re: Improving real world performance by moving files closer to their target workloads

"Amar S. Tumballi" <amar@xxxxxxxxxxxxx> · Thu, 15 May 2008 15:56:50 -0700

Hi Luke,
 Feels good that your university is looking into GlusterFS. Few tips inline.

On Thu, May 15, 2008 at 2:25 PM, Luke McGregor <luke@xxxxxxxxxxxxxxx> wrote:

> Hi
>
> Im Luke McGregor and im working on a project at the university of
> waikato computer science department to make some improvements to
> GLusterFS to improve performance for our specific application.

Understanding i/o pattern of the application will generally help to tune the
filesystem to very good performance. You can look into it.

> We are
> implementing a fairly small cluster (90 machines currently) to use for
> large scale computing projects. This machine is being built using
> comodity hardware and backended into a gigabit ethernet backbone with
> 10G uplinks between switches. Each node in the cluster will be
> responsible for both storage and workload processing. This is to be
> achieved with single sata disks in the machines.
>
You may use single process for both server and client to save overhead due
to context switching.

>
> We are currently experimenting with running GLuster over the nodes in
> the cluster to produce a single large filesystem. For my Honors
> research project ive been asked to look into making some improvements
> to GLuster to try to improve performance by moving the files within
> the GLusterFS closer to the node which is accessing the file.
>
You may look at NUFA scheduler. We are thinking of a way to reduce the
overhead in case of spec file management for NUFA. Which may come soon.

>
> What i was wondering is basically how hard would it be to write code
> to modify the metadata so that when a file is accessed it is then
> moved to the node which it is accessed from and its location is
> updated in the metadata.
>
There is no metadata is stored about the location of the file. But I am not
sure why you want to keep moving file :O if a file is moved to another node
when its accessed, what are the guarantee that its not accessed by two nodes
at a time (hence two copies and it may lead to I/O errors from GlusterFS).
Also you will have lot of overhead in doing that. You may think of using I/O
-cache. or implementing HSM.

-Amar

-- 
Amar Tumballi
Gluster/GlusterFS Hacker
[bulde on #gluster/irc.gnu.org]
http://www.zresearch.com - Commoditizing Super Storage!