When glusterfs mount fuse, It uses the max_read=128KB option. Any big request would be split. Tuning the option, it will be faster in big read and write, but no use for small files.
At 2013-03-11 18:49:47,"Xavier Hernandez" <xhernandez@xxxxxxxxxx> wrote: >Hello, > >I've recently performed some tests with gluster on a fast network (IP >over infiniband) and got some unexpected results. It seems that >mount/fuse is becoming a bottleneck when the network and disk are very fast. > >I started with a simple distributed volume with 2 bricks mounted on a >ramdisk to avoid possible disk bottlenecks (however I repeated the tests >with an SSD and, later, with a normal hard disk and the results were the >same, probably due to the good work of performance translators). With >this configuration, a single write reached a throughput of ~420 MB/s. >It's way below the maximum network limit, but for a single write it's >quite acceptable. However with two concurrent writes (carefully chosen >so that each one goes to a different brick), the throughput was ~200 >MB/s (for each transfer). That was totally unexpected. As there was >plenty of bandwith available and no IO limitation, I was expecting >something near 800 MB/s. > >In fact, any combination of concurrent writes always led to the same >combined throughput of ~400 MB/s. > >Trying to determine the cause of this odd behavior, I noticed that >mount/fuse uses a single thread to serve kernel requests, and once a >request is received, it is sent down the xlator stack to process it, >only reading additional requests once the stack returns. This means that >to reach a 420 MB/s throughput using 128KB per request (the current >maximum block size), it needs to serve, at least, 3360 requests per >second. In other words, it processes each request in 300 us. If we take >into account that every translator will allocate memory, and do some >system calls, it's quite possible that it really takes 300 us to serve >each request. > >To see if this is the case, I added the performance/io-threads just >below the mount/fuse. This would queue each request to a different >thread, freeing the current one to read another request much before than >300 us. This should improve the concurrent writes case. > >The results are good. Using this simple modification, 2 concurrent >writes performed at ~300 MB/s each one. However the throughput for a >single write dropped to ~250 MB/s. Anyway, this solution is not valid >because there is some incompatibility with this configuration and some >things do not work well (for example a simple 'ls' does not show all the >files). > >Then I modified the mount/fuse xlator to start some threads to serve >kernel requests. With this modification all seems to work as expected >and throughput is quite better: a single write still performs at 420 >MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination >of 2 or more concurrent writes has a combined throughput of ~650 MB/s. > >However, a replicate volume does not improve at all. I'm not sure why. >It seems that there should be some kind of serialization point in >cluster/afr. A single write has a throughput of ~175 MB/s, and 2 >concurrent writes ~85 MB/s. I'll have to investigate this further. > >Does all this make sense ? > >Is this something that would be worth investing more time ? > >Regards, > >Xavi > >_______________________________________________ >Gluster-devel mailing list >Gluster-devel@xxxxxxxxxx >https://lists.nongnu.org/mailman/listinfo/gluster-devel