I am afraid I am not experienced enough to be much more useful.
My guess is that, since client is writing synchronously to all node (to keep data coherent), it's going as fast as the slowest brick.
Then small files are often slow because TCP windows doesn't have time to grow up.
That's why I gave you some kernel tuning to help TCP Windows to get bigger faster.
Do you use latest version (3.7.1) ?
2015-06-20 11:01 GMT+02:00 Geoffrey Letessier <geoffrey.letessier@xxxxxxx>:
Hello Mathieu,Thanks for replying.Previously, i’ve never notice such throughput performances (around 1GBs for 1 big file) but.... The situation with a « big » set of small files wasn’t amazing but not such bad than today.The problem seems to concern exclusively the size of each file."proof":[root@node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=10001000+0 enregistrements lus1000+0 enregistrements écrits1048576000 octets (1,0 GB) copiés, 2,09139 s, 501 MB/s[root@node056 tmp]# time split -b 1000000 -a 12 masterfile # 1MB per filereal 0m42.841suser 0m0.004ssys 0m1.416s[root@node056 tmp]# rm -f xaaaaaaaaa* && sync[root@node056 tmp]# time split -b 5000000 -a 12 masterfile # 5 MB per filereal 0m17.801suser 0m0.008ssys 0m1.396s[root@node056 tmp]# rm -f xaaaaaaaaa* && sync[root@node056 tmp]# time split -b 10000000 -a 12 masterfile # 10MB per filereal 0m9.686suser 0m0.008ssys 0m1.451s[root@node056 tmp]# rm -f xaaaaaaaaa* && sync[root@node056 tmp]# time split -b 20000000 -a 12 masterfile # 20MB per filereal 0m9.717suser 0m0.003ssys 0m1.399s[root@node056 tmp]# rm -f xaaaaaaaaa* && sync[root@node056 tmp]# time split -b 1000000 -a 12 masterfile # 10MB per filereal 0m40.283suser 0m0.007ssys 0m1.390s[root@node056 tmp]# rm -f xaaaaaaaaa* && syncHigher is the generated file size, best is the performance (IO throughput and running time)… ifstat output is coherent from both client/node and server side..a new test:[root@node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=1000010000+0 enregistrements lus10000+0 enregistrements écrits10485760000 octets (10 GB) copiés, 23,0044 s, 456 MB/s[root@node056 tmp]# rm -f xaaaaaaaaa* && sync[root@node056 tmp]# time split -b 10000000 -a 12 masterfile # 10MB per filereal 1m43.216suser 0m0.038ssys 0m13.407sSo the performance per file is the same (despite of 10x more files)So, i dont understand why, to get the best performance, i need to create file with a size of 10MB or more.Here are my volume reconfigured options:performance.cache-max-file-size: 64MBperformance.read-ahead: onperformance.write-behind: onfeatures.quota-deem-statfs: onperformance.stat-prefetch: onperformance.flush-behind: onfeatures.default-soft-limit: 90%features.quota: ondiagnostics.brick-log-level: CRITICALauth.allow: localhost,127.0.0.1,10.*nfs.disable: onperformance.cache-size: 1GBperformance.write-behind-window-size: 4MBperformance.quick-read: onperformance.io-cache: onperformance.io-thread-count: 64nfs.enable-ino32: offIt’s not a local cache trouble because:1- it’s disabled in my mount command mount -t glusterfs -o transport=rdma,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /home2- i made my test also playing with /proc/sys/vm/drop_caches3- I note the same ifstat output from both client and server side which is coherent with the computing of bandwidth (file sizes / time (considering the replication).I think it’s not an infiniband network trouble but here are my [not default] settings:connected mode with MTU set to 65520Do you confirm my feelings? If yes, do you have any other idea?Thanks again and thanks by advance,Geoffrey-----------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxxLe 20 juin 2015 à 09:12, Mathieu Chateau <mathieu.chateau@xxxxxxx> a écrit :Hello,for the replicated one, is it a new issue or you just didn't notice before ? Same baseline as before?I also have slowness with small files/many files.For now I could only tune up things with:On gluster level:gluster volume set myvolume performance.io-thread-count 16gluster volume set myvolume performance.cache-size 1GBgluster volume set myvolume nfs.disable ongluster volume set myvolume readdir-ahead enablegluster volume set myvolume read-ahead disableOn network level (client and server) (I don't have infiniband):sysctl -w vm.swappiness=0sysctl -w net.core.rmem_max=67108864sysctl -w net.core.wmem_max=67108864
# increase Linux autotuning TCP buffer limit to 32MB
sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"
# increase the length of the processor input queue
sysctl -w net.core.netdev_max_backlog=30000
# recommended default congestion control is htcp
sysctl -w net.ipv4.tcp_congestion_control=htcpBut it's still really slow, even if better2015-06-20 2:34 GMT+02:00 Geoffrey Letessier <geoffrey.letessier@xxxxxxx>:Re,For comparison, here is the output of the same script run on a distributed only volume (2 servers of the 4 previously described, 2 bricks each):####################################################################### UNTAR time consumed #######################################################################real 1m44.698suser 0m8.891ssys 0m8.353s######################################################################## DU time consumed #########################################################################554M linux-4.1-rc6real 0m21.062suser 0m0.100ssys 0m1.040s######################################################################## FIND time consumed #######################################################################52663real 0m21.325suser 0m0.104ssys 0m1.054s######################################################################## GREP time consumed #######################################################################7952real 0m43.618suser 0m0.922ssys 0m3.626s######################################################################## TAR time consumed ########################################################################real 0m50.577suser 0m29.745ssys 0m4.086s######################################################################## RM time consumed #########################################################################real 0m41.133suser 0m0.171ssys 0m2.522sThe performances are amazing different!Geoffrey-----------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxxLe 20 juin 2015 à 02:12, Geoffrey Letessier <geoffrey.letessier@xxxxxxx> a écrit :<benches.txt>Dear all,I just noticed on my main volume of my HPC cluster my IO operations become impressively poor..Doing some file operations above a linux kernel sources compressed file, the untar operation can take more than 1/2 hours for this file (roughly 80MB and 52 000 files inside) as you read below:####################################################################### UNTAR time consumed #######################################################################real 32m42.967suser 0m11.783ssys 0m15.050s######################################################################## DU time consumed #########################################################################557M linux-4.1-rc6real 0m25.060suser 0m0.068ssys 0m0.344s######################################################################## FIND time consumed #######################################################################52663real 0m25.687suser 0m0.084ssys 0m0.387s######################################################################## GREP time consumed #######################################################################7952real 2m15.890suser 0m0.887ssys 0m2.777s######################################################################## TAR time consumed ########################################################################real 1m5.551suser 0m26.536ssys 0m2.609s######################################################################## RM time consumed #########################################################################real 2m51.485suser 0m0.167ssys 0m1.663sFor information, this volume is a distributed replicated one and is composed by 4 servers with 2 bricks each. Each bricks is a 12-drives RAID6 vdisk with nice native performances (around 1.2GBs).In comparison, when I use DD to generate a 100GB file on the same volume, my write throughput is around 1GB (client side) and 500MBs (server side) because of replication:Client side:[root@node056 ~]# ifstat -i ib0ib0KB/s in KB/s out3251.45 1.09e+063139.80 1.05e+063185.29 1.06e+063293.84 1.09e+06...Server side:[root@lucifer ~]# ifstat -i ib0ib0KB/s in KB/s out561818.1 1746.42560020.3 1737.92526337.1 1648.20513972.7 1613.69...DD command:[root@node056 ~]# dd if=/dev/zero of=/home/root/test.dd bs=1M count=100000100000+0 enregistrements lus100000+0 enregistrements écrits104857600000 octets (105 GB) copiés, 202,99 s, 517 MB/sSo this issue doesn’t seem coming from the network (which is Infiniband technology in this case)You can find in attachments a set of files:- mybench.sh: the bench script- benches.txt: output of my "bench"- profile.txt: gluster volume profile during the "bench"- vol_status.txt: gluster volume status- vol_info.txt: gluster volume infoCan someone help me to fix it (it’s very critical because this volume is on a HPC cluster in production).Thanks by advance,Geoffrey-----------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx<mybench.sh><profile.txt><vol_info.txt><vol_status.txt>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users