When glusterfs mount fuse, It uses the max_read=128KB option.
Any big
request
would be split. Tuning the option, it will be faster in big
read and write, but no use for small files.
At 2013-03-11 18:49:47,"Xavier Hernandez" <xhernandez@xxxxxxxxxx> wrote:
>Hello,
>
>I've recently performed some tests with gluster on a fast network (IP
>over infiniband) and got some unexpected results. It seems that
>mount/fuse is becoming a bottleneck when the network and disk are very fast.
>
>I started with a simple distributed volume with 2 bricks mounted on a
>ramdisk to avoid possible disk bottlenecks (however I repeated the tests
>with an SSD and, later, with a normal hard disk and the results were the
>same, probably due to the good work of performance translators). With
>this configuration, a single write reached a throughput of ~420 MB/s.
>It's way below the maximum network limit, but for a single write it's
>quite acceptable. However with two concurrent writes (carefully chosen
>so that each one goes to a different brick), the throughput was ~200
>MB/s (for each transfer). That was totally unexpected. As there was
>plenty of bandwith available and no IO limitation, I was expecting
>something near 800 MB/s.
>
>In fact, any combination of concurrent writes always led to the same
>combined throughput of ~400 MB/s.
>
>Trying to determine the cause of this odd behavior, I noticed that
>mount/fuse uses a single thread to serve kernel requests, and once a
>request is received, it is sent down the xlator stack to process it,
>only reading additional requests once the stack returns. This means that
>to reach a 420 MB/s throughput using 128KB per request (the current
>maximum block size), it needs to serve, at least, 3360 requests per
>second. In other words, it processes each request in 300 us. If we take
>into account that every translator will allocate memory, and do some
>system calls, it's quite possible that it really takes 300 us to serve
>each request.
>
>To see if this is the case, I added the performance/io-threads just
>below the mount/fuse. This would queue each request to a different
>thread, freeing the current one to read another request much before than
>300 us. This should improve the concurrent writes case.
>
>The results are good. Using this simple modification, 2 concurrent
>writes performed at ~300 MB/s each one. However the throughput for a
>single write dropped to ~250 MB/s. Anyway, this solution is not valid
>because there is some incompatibility with this configuration and some
>things do not work well (for example a simple 'ls' does not show all the
>files).
>
>Then I modified the mount/fuse xlator to start some threads to serve
>kernel requests. With this modification all seems to work as expected
>and throughput is quite better: a single write still performs at 420
>MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination
>of 2 or more concurrent writes has a combined throughput of ~650 MB/s.
>
>However, a replicate volume does not improve at all. I'm not sure why.
>It seems that there should be some kind of serialization point in
>cluster/afr. A single write has a throughput of ~175 MB/s, and 2
>concurrent writes ~85 MB/s. I'll have to investigate this further.
>
>Does all this make sense ?
>
>Is this something that would be worth investing more time ?
>
>Regards,
>
>Xavi
>
>_______________________________________________
>Gluster-devel mailing list
>Gluster-devel@xxxxxxxxxx
>https://lists.nongnu.org/mailman/listinfo/gluster-devel