Gluster 3.1.1 issues over RDMA and HPC environment

fcannini at gmail.com (Fabricio Cannini) · Mon, 7 Feb 2011 17:26:32 -0200

Em Domingo 06 Fevereiro 2011, ?s 16:35:45, Claudio Baeza Retamal escreveu:

Hi.

> Dear friends,
> 
> I have several problems of stability, reliability in a small-middle
> sized cluster, my configuration is the following:
> 
> 66 compute nodes (IBM idataplex, X5550, 24 GB RAM)
> 1 access node (front end)
> 1 master node (queue manager and monotoring)
> 2 server for I/O with GlusterFS configured in distributed mode (4 TB in
> total)
> 
> All computer have a Mellanox ConnectX QDR (40 Gbps) dual port
> 1 Switch Qlogic 12800-180, 7 leaf of 24 ports each one and two double
> Spines QSFP plug
> 
> Centos 5.5 and Xcat as cluster manager
> Ofed 1.5.1
> Gluster 3.1.1 over inbiniband

I have a smaller, but relatively similar setup, and am facing the same issues 
of Claudio.

- 1 frontend node ( 2 intel xeon 5420 , 16gb ram DDR2 ECC , 4TB of raw disk 
space ) with 2 "Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0 5GT/s - 
IB DDR"

- 1 storage node ( 2 intel xeon 5420 , 24gb ram DDR@ ECC, 8TB of raw disk 
space ) with 2 "Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0 5GT/s - 
IB DDR"

- 22 compute nodes  ( 2 intel xeon 5420 , 16gb ram DDR2 ECC , 750GB of raw 
disk space ) with 1 "InfiniBand: Mellanox Technologies MT25204 [InfiniHost III 
Lx HCA]"

Each compute node has a /glstfs partition, with 615GB , serving a gluster 
volume of ~3.1TB in /scratch for all nodes and the frontend, using 3.0.5 stock 
debian squeeze 6.0 packages.

> When the cluster is full loaded for applications which use heavily  MPI
> in combination with other application which uses a lot of I/O to file
> system,  GlusterFS do not work anymore.
> Also, when gendb uses interproscan bioinformatic applications with 128 o
> more jobs, GlusterFS death  or disconnects clients randomly, so, some
> applicatios become shutdown due they do not see the file system.
> 
> This do not happen with Gluster over tcp (ethernet 1 Gbps)  and neither
> happen with Lustre 1.8.5 over infiniband, under same conditions Lustre
> work fine.
> 
> My question is, exist any documenation where there are information more
> especific for GlusterFS tuning?
> 
> Only I found basic information for configuring Gluster, but I do no have
> information more deep (i.e. for experts), I think must exist  some
> option for manipulate this siuation on GlusterFS, moreover, other people
> should have the same problems, since we replicate
>   the configuration in other site with the same results.
> Perhaps, the question is about  the gluster scalability, how many
> clients is recommended for each gluster server when I use RDMA and
> infiniband fabric at 40 Gbps?
> 
> I would appreciate any help,  I want to use Gluster, but stability and
> reliability  is very important for us. Perhaps

I have "solved" it , by taking out of the executing queue the first node that 
was listed in the client file '/etc/glusterfs/glusterfs.vol'.
And this what i *think* is the reason it worked:

I can't find it now, but i saw in the 3.0 docs that " ... the first hostname 
found in the client config file acts as a lock server for the whole volume...". 
In other words, the first hostname found in the client config coordinates the 
locking/unlocking of files in the whole volume. This way, the node does not 
accepts any job, and can dedicate its processing power solely as a 'lock 
server'.

it may well be the case that gluster is not yet as optimized for infiniband as 
it is for ethernet, too. I just can't say.

I am also unable to find how i can specify something like this in the gluster 
config: "node n is a lock server for nodes a,b,c,d". Does anybody if is it 
possible?

Hope it helps you somehow, and to improve gluster performance over IB/RDMA.