Re: Some performance issues in mount/fuse

lierihanmei <lierihanmei@xxxxxxx> · Tue, 12 Mar 2013 15:16:21 +0800 (CST)

When glusterfs mount fuse, It uses the max_read=128KB option.  Any big  request would be split.  Tuning the option, it will be faster in big read and write,  but no use for small files.

At 2013-03-11 18:49:47,"Xavier Hernandez" <xhernandez@xxxxxxxxxx> wrote:
>Hello,
>
>I've recently performed some tests with gluster on a fast network (IP 
>over infiniband) and got some unexpected results. It seems that 
>mount/fuse is becoming a bottleneck when the network and disk are very fast.
>
>I started with a simple distributed volume with 2 bricks mounted on a 
>ramdisk to avoid possible disk bottlenecks (however I repeated the tests 
>with an SSD and, later, with a normal hard disk and the results were the 
>same, probably due to the good work of performance translators). With 
>this configuration, a single write reached a throughput of ~420 MB/s. 
>It's way below the maximum network limit, but for a single write it's 
>quite acceptable. However with two concurrent writes (carefully chosen 
>so that each one goes to a different brick), the throughput was ~200 
>MB/s (for each transfer). That was totally unexpected. As there was 
>plenty of bandwith available and no IO limitation, I was expecting 
>something near 800 MB/s.
>
>In fact, any combination of concurrent writes always led to the same 
>combined throughput of ~400 MB/s.
>
>Trying to determine the cause of this odd behavior, I noticed that 
>mount/fuse uses a single thread to serve kernel requests, and once a 
>request is received, it is sent down the xlator stack to process it, 
>only reading additional requests once the stack returns. This means that 
>to reach a 420 MB/s throughput using 128KB per request (the current 
>maximum block size), it needs to serve, at least, 3360 requests per 
>second. In other words, it processes each request in 300 us. If we take 
>into account that every translator will allocate memory, and do some 
>system calls, it's quite possible that it really takes 300 us to serve 
>each request.
>
>To see if this is the case, I added the performance/io-threads just 
>below the mount/fuse. This would queue each request to a different 
>thread, freeing the current one to read another request much before than 
>300 us. This should improve the concurrent writes case.
>
>The results are good. Using this simple modification, 2 concurrent 
>writes performed at ~300 MB/s each one. However the throughput for a 
>single write dropped to ~250 MB/s. Anyway, this solution is not valid 
>because there is some incompatibility with this configuration and some 
>things do not work well (for example a simple 'ls' does not show all the 
>files).
>
>Then I modified the mount/fuse xlator to start some threads to serve 
>kernel requests. With this modification all seems to work as expected 
>and throughput is quite better: a single write still performs at 420 
>MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination 
>of 2 or more concurrent writes has a combined throughput of ~650 MB/s.
>
>However, a replicate volume does not improve at all. I'm not sure why. 
>It seems that there should be some kind of serialization point in 
>cluster/afr. A single write has a throughput of ~175 MB/s, and 2 
>concurrent writes ~85 MB/s. I'll have to investigate this further.
>
>Does all this make sense ?
>
>Is this something that would be worth investing more time ?
>
>Regards,
>
>Xavi
>
>_______________________________________________
>Gluster-devel mailing list
>Gluster-devel@xxxxxxxxxx
>https://lists.nongnu.org/mailman/listinfo/gluster-devel