There's one thing obviously wrong in the previous email. GlusterFS is able to use more than one CPU core because we are running 4x4 I/O threads on each node. I got the 99% number from ganglia output which is not correct. Sorry for that - Wei Wei Dong wrote: > Hi All, > > I'm experiencing extremely slow auto-heal rate with glusterfs and I > want to hear from you guys to see if it seems reasonable or > something's wrong. > > I did a reconfiguration of the glusterfs running on our lab cluster. > Originally we have 25 nodes, each provide 1 1.5T SATA disk, the data > are not replicated. Now we have expanded the cluster to 66 nodes and > I decided to use all the 4 disks of each node and have the data > replicated 3 times on different machines. What I did is to leave the > original data untouched, and reconfigure glusterfs so that the > original data volumes are paired up with new empty mirror volumes. On > the server side, I have each node exports one volume for each of the 4 > disks, with an IO-threads translator with 4 threads running on top of > each disk and no other performance translators. On the client side > which is mounted on each of the 66 nodes, I group all the volumes into > mirrors of 3 volumes each, and then aggregate the mirrors with one > DHT, and put a write-behind translator with window-size 1MB on top of > that. > > The files are all images, roughly 200K each. > > To trigger auto-heal, I split a list of all the files (previously > saved before reconfiguration) among the 66 nodes and have 4 threads on > each nodes running the stat command on the files. The overall rate is > about 50 files per second, which I think is very low. And will the > auto-heal is running, all operations like cd and ls on the glusterfs > client becomes extremely slow, each takes like one minute to finish. > > On > http://www.gluster.org/docs/index.php/GlusterFS_2.0_I/O_Benchmark_Results, > which uses 5 servers and one raid0 disk each node, with 5 threads > running on 3 clients, about 61K 1M files can be created within 10min, > at an average rate of 130 files/second. The glusterfsd processes are > each taking about 99% CPU time ( > we have 8 cores / node, but it seems glusterfsd is able to use only 1). > > There only advantage is that they use RAID0 disk and clients and > servers are different machines. Other than that, we have more > servers/clients, more disks on each node, and I have configured > IO-thread and write-behind (which I don't think helps auto heal), and > our files are only about 1/5 their size. Even if I count each file as > 3 for replication, I'm only achieving similar throughput as a much > smaller cluster. I don't know why it's like this and following are my > hypothesis: > > 1. The overall through of glusterfs doesn't scale up to so many nodes; > 2. The auto-heal operating is slower than creating the file from > scratch; > 3. Glusterfs is CPU bound -- it seems to be the case, but I'm > unwilling to accept that a filesystem is CPU bound; > 4. The network bandwith is saturated. I think we have 1gigabit > eithernet on the nodes, which are on two racks and connected by some > Foundry switch with 4 gigabit aggregate inter-rack bandwidth; > 5. Replication is slow. > > Finally, the good news is that all the servers and client processes > have been running for about 2 hours under stress, which I didn't > expect in the beginning. Good job you glusterfs people! > > We are desperate for a shared storage and I'm eager to hear any > suggestion you have to make glusterfs perform better. > > - Wei >