extreme slow recover rate

wdong.pku at gmail.com (Wei Dong) · Thu, 20 Aug 2009 15:16:01 -0400

There's one thing obviously wrong in the previous email.  GlusterFS is 
able to use more than one CPU core because we are running 4x4 I/O 
threads on each node.  I got the 99% number from ganglia output which is 
not correct.

Sorry for that

- Wei

Wei Dong wrote:
> Hi All,
>
> I'm experiencing extremely slow auto-heal rate with glusterfs and I 
> want to hear from you guys to see if it seems reasonable or 
> something's wrong.
>
> I did a reconfiguration of the glusterfs running on our lab cluster.  
> Originally we have 25 nodes, each provide 1 1.5T SATA disk, the data 
> are not replicated.  Now we have expanded the cluster to 66 nodes and 
> I decided to use all the 4 disks of each node and have the data 
> replicated 3 times on different machines.  What I did is to leave the 
> original data untouched, and reconfigure glusterfs so that the 
> original data volumes are paired up with new empty mirror volumes.  On 
> the server side, I have each node exports one volume for each of the 4 
> disks, with an IO-threads translator with 4 threads running on top of 
> each disk and no other performance translators.  On the client side 
> which is mounted on each of the 66 nodes, I group all the volumes into 
> mirrors of 3 volumes each, and then aggregate the mirrors with one 
> DHT, and put a write-behind translator with window-size 1MB on top of 
> that.
>
> The files are all images, roughly 200K each.
>
> To trigger auto-heal, I split a list of all the files (previously 
> saved before reconfiguration) among the 66 nodes and have 4 threads on 
> each nodes running the stat command on the files.  The overall rate is 
> about 50 files per second, which I think is very low.  And will the 
> auto-heal is running, all operations like cd and ls on the glusterfs 
> client becomes extremely slow, each takes like one minute to finish.
>
> On 
> http://www.gluster.org/docs/index.php/GlusterFS_2.0_I/O_Benchmark_Results, 
> which uses 5 servers and one raid0 disk each node, with 5 threads 
> running on 3 clients, about 61K 1M files can be created within 10min, 
> at an average rate of 130 files/second.  The glusterfsd processes are 
> each taking about 99% CPU time (
> we have 8 cores / node, but it seems glusterfsd is able to use only 1).
>
> There only advantage is that they use RAID0 disk and clients and 
> servers are different machines.  Other than that, we have more 
> servers/clients, more disks on each node, and I have configured 
> IO-thread and write-behind (which I don't think helps auto heal), and 
> our files are only about 1/5 their size.  Even if I count each file as 
> 3 for replication, I'm only achieving similar throughput as a much 
> smaller cluster.  I don't know why it's like this and following are my 
> hypothesis:
>
> 1.  The overall through of glusterfs doesn't scale up to so many nodes;
> 2.  The auto-heal operating is slower than creating the file from 
> scratch;
> 3.  Glusterfs is CPU bound -- it seems to be the case, but I'm 
> unwilling to accept that a filesystem is CPU bound;
> 4.  The network bandwith is saturated.  I think we have 1gigabit 
> eithernet on the nodes, which are on two racks and connected by some 
> Foundry switch with 4 gigabit aggregate inter-rack bandwidth;
> 5.  Replication is slow.
>
> Finally, the good news is that all the servers and client processes 
> have been running for about 2 hours under stress, which I didn't 
> expect in the beginning.  Good job you glusterfs people!
>
> We are desperate for a shared storage and I'm eager to hear any 
> suggestion you have to make glusterfs perform better.
>
> - Wei
>