Re: GlusterFS 3.5.3 - untar: very poor performance

Mathieu Chateau <mathieu.chateau@xxxxxxx> · Sat, 20 Jun 2015 11:11:38 +0200

I am afraid I am not experienced enough to be much more useful.
My guess is that, since client is writing synchronously to all node (to keep data coherent), it's going as fast as the slowest brick.

Then small files are often slow because TCP windows doesn't have time to grow up.
That's why I gave you some kernel tuning to help TCP Windows to get bigger faster.

Do you use latest version (3.7.1) ?

Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

2015-06-20 11:01 GMT+02:00 Geoffrey Letessier <geoffrey.letessier@xxxxxxx>:
Hello Mathieu,
Thanks for replying.

Previously, i’ve never notice such throughput performances (around 1GBs for 1 big file) but.... The situation with a « big » set of small files wasn’t amazing but not such bad than today.

The problem seems to concern exclusively the size of each file.
"proof": 
[root@node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=1000
1000+0 enregistrements lus
1000+0 enregistrements écrits
1048576000 octets (1,0 GB) copiés, 2,09139 s, 501 MB/s
[root@node056 tmp]# time split -b 1000000 -a 12 masterfile  # 1MB per file

real	0m42.841s
user	0m0.004s
sys	0m1.416s
[root@node056 tmp]# rm -f xaaaaaaaaa* && sync
[root@node056 tmp]# time split -b 5000000 -a 12 masterfile  # 5 MB per file

real	0m17.801s
user	0m0.008s
sys	0m1.396s
[root@node056 tmp]# rm -f xaaaaaaaaa* && sync
[root@node056 tmp]# time split -b 10000000 -a 12 masterfile  # 10MB per file

real	0m9.686s
user	0m0.008s
sys	0m1.451s
[root@node056 tmp]# rm -f xaaaaaaaaa* && sync
[root@node056 tmp]# time split -b 20000000 -a 12 masterfile  # 20MB per file

real	0m9.717s
user	0m0.003s
sys	0m1.399s
[root@node056 tmp]# rm -f xaaaaaaaaa* && sync
[root@node056 tmp]# time split -b 1000000 -a 12 masterfile  # 10MB per file

real	0m40.283s
user	0m0.007s
sys	0m1.390s
[root@node056 tmp]# rm -f xaaaaaaaaa* && sync

Higher is the generated file size, best is the performance (IO throughput and running time)… ifstat output is coherent from both client/node and server side..

a new test:
[root@node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=10000
10000+0 enregistrements lus
10000+0 enregistrements écrits
10485760000 octets (10 GB) copiés, 23,0044 s, 456 MB/s
[root@node056 tmp]# rm -f xaaaaaaaaa* && sync
[root@node056 tmp]# time split -b 10000000 -a 12 masterfile  # 10MB per file

real	1m43.216s
user	0m0.038s
sys	0m13.407s

So the performance per file is the same (despite of 10x more files)

So, i dont understand why, to get the best performance, i need to create file with a size of 10MB or more.

Here are my volume reconfigured options:
performance.cache-max-file-size: 64MB
performance.read-ahead: on
performance.write-behind: on
features.quota-deem-statfs: on
performance.stat-prefetch: on
performance.flush-behind: on
features.default-soft-limit: 90%
features.quota: on
diagnostics.brick-log-level: CRITICAL
auth.allow: localhost,127.0.0.1,10.*
nfs.disable: on
performance.cache-size: 1GB
performance.write-behind-window-size: 4MB
performance.quick-read: on
performance.io-cache: on
performance.io-thread-count: 64
nfs.enable-ino32: off

It’s not a local cache trouble because:
	1- it’s disabled in my mount command mount -t glusterfs -o transport=rdma,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /home
	2- i made my test also playing with /proc/sys/vm/drop_caches
	3- I note the same ifstat output from both client and server side which is coherent with the computing of bandwidth (file sizes / time (considering the replication).

I think it’s not an infiniband network trouble but here are my [not default] settings:
connected mode with MTU set to 65520 

Do you confirm my feelings? If yes, do you have any other idea?

Thanks again and thanks by advance,
Geoffrey

-----------------------------------------------
Geoffrey Letessier

Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 20 juin 2015 à 09:12, Mathieu Chateau <mathieu.chateau@xxxxxxx> a écrit :

Hello,
for the replicated one, is it a new issue or you just didn't notice before ? Same baseline as before?

I also have slowness with small files/many files.

For now I could only tune up things with:

On gluster level:
gluster volume set myvolume performance.io-thread-count 16
gluster volume set myvolume  performance.cache-size 1GB
gluster volume set myvolume nfs.disable on
gluster volume set myvolume readdir-ahead enable
gluster volume set myvolume read-ahead disable

On network level (client and server) (I don't have infiniband):
sysctl -w vm.swappiness=0
sysctl -w net.core.rmem_max=67108864
sysctl -w net.core.wmem_max=67108864

# increase Linux autotuning TCP buffer limit to 32MB

sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"

sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"

# increase the length of the processor input queue

sysctl -w net.core.netdev_max_backlog=30000

# recommended default congestion control is htcp

sysctl -w net.ipv4.tcp_congestion_control=htcp

But it's still really slow, even if better

Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

2015-06-20 2:34 GMT+02:00 Geoffrey Letessier <geoffrey.letessier@xxxxxxx>:
Re,
For comparison, here is the output of the same script run on a distributed only volume (2 servers of the 4 previously described, 2 bricks each):#######################################################
################  UNTAR time consumed  ################
#######################################################

real	1m44.698s
user	0m8.891s
sys	0m8.353s

#######################################################
#################  DU time consumed  ##################
#######################################################

554M	linux-4.1-rc6

real	0m21.062s
user	0m0.100s
sys	0m1.040s

#######################################################
#################  FIND time consumed  ################
#######################################################

52663

real	0m21.325s
user	0m0.104s
sys	0m1.054s

#######################################################
#################  GREP time consumed  ################
#######################################################

7952

real	0m43.618s
user	0m0.922s
sys	0m3.626s

#######################################################
#################  TAR time consumed  #################
#######################################################

real	0m50.577s
user	0m29.745s
sys	0m4.086s

#######################################################
#################  RM time consumed  ##################
#######################################################

real	0m41.133s
user	0m0.171s
sys	0m2.522s

The performances are amazing different!

Geoffrey

-----------------------------------------------
Geoffrey Letessier

Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 20 juin 2015 à 02:12, Geoffrey Letessier <geoffrey.letessier@xxxxxxx> a écrit :

Dear all,
I just noticed on my main volume of my HPC cluster my IO operations become impressively poor.. 

Doing some file operations above a linux kernel sources compressed file, the untar operation can take more than 1/2 hours for this file (roughly 80MB and 52 000 files inside) as you read below:
#######################################################
################  UNTAR time consumed  ################
#######################################################

real	32m42.967s
user	0m11.783s
sys	0m15.050s

#######################################################
#################  DU time consumed  ##################
#######################################################

557M	linux-4.1-rc6

real	0m25.060s
user	0m0.068s
sys	0m0.344s

#######################################################
#################  FIND time consumed  ################
#######################################################

52663

real	0m25.687s
user	0m0.084s
sys	0m0.387s

#######################################################
#################  GREP time consumed  ################
#######################################################

7952

real	2m15.890s
user	0m0.887s
sys	0m2.777s

#######################################################
#################  TAR time consumed  #################
#######################################################

real	1m5.551s
user	0m26.536s
sys	0m2.609s

#######################################################
#################  RM time consumed  ##################
#######################################################

real	2m51.485s
user	0m0.167s
sys	0m1.663s

For information, this volume is a distributed replicated one and is composed by 4 servers with 2 bricks each. Each bricks is a 12-drives RAID6 vdisk with nice native performances (around 1.2GBs).

In comparison, when I use DD to generate a 100GB file on the same volume, my write throughput is around 1GB (client side) and 500MBs (server side) because of replication:
Client side:
[root@node056 ~]# ifstat -i ib0
       ib0        
 KB/s in  KB/s out
 3251.45  1.09e+06
 3139.80  1.05e+06
 3185.29  1.06e+06
 3293.84  1.09e+06
...

Server side:
[root@lucifer ~]# ifstat -i ib0
       ib0        
 KB/s in  KB/s out
561818.1   1746.42
560020.3   1737.92
526337.1   1648.20
513972.7   1613.69
...

DD command:
[root@node056 ~]# dd if=/dev/zero of=/home/root/test.dd bs=1M count=100000
100000+0 enregistrements lus
100000+0 enregistrements écrits
104857600000 octets (105 GB) copiés, 202,99 s, 517 MB/s

So this issue doesn’t seem coming from the network (which is Infiniband technology in this case)

You can find in attachments a set of files:
	- mybench.sh: the bench script
	- benches.txt: output of my "bench"
	- profile.txt: gluster volume profile during the "bench"
	- vol_status.txt: gluster volume status
	- vol_info.txt: gluster volume info

Can someone help me to fix it (it’s very critical because this volume is on a HPC cluster in production).

Thanks by advance,
Geoffrey

-----------------------------------------------
Geoffrey Letessier

Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

<benches.txt>
<mybench.sh>
<profile.txt>
<vol_info.txt>
<vol_status.txt>

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users