Re: GlusterFS 3.7 - slow/poor performances

Geoffrey Letessier <geoffrey.letessier@xxxxxxx> · Mon, 8 Jun 2015 21:59:26 +0200

Hi Ben
Here the expected output:
[root@node048 ~]# iperf3 -c 10.0.4.1
Connecting to host 10.0.4.1, port 5201
[  4] local 10.0.5.48 port 44151 connected to 10.0.4.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.86 GBytes  15.9 Gbits/sec    0   8.24 MBytes       
[  4]   1.00-2.00   sec  1.94 GBytes  16.7 Gbits/sec    0   8.24 MBytes       
[  4]   2.00-3.00   sec  1.95 GBytes  16.8 Gbits/sec    0   8.24 MBytes       
[  4]   3.00-4.00   sec  1.86 GBytes  16.0 Gbits/sec    0   8.24 MBytes       
[  4]   4.00-5.00   sec  1.85 GBytes  15.8 Gbits/sec    0   8.24 MBytes       
[  4]   5.00-6.00   sec  1.89 GBytes  16.2 Gbits/sec    0   8.24 MBytes       
[  4]   6.00-7.00   sec  1.90 GBytes  16.3 Gbits/sec    0   8.24 MBytes       
[  4]   7.00-8.00   sec  1.88 GBytes  16.1 Gbits/sec    0   8.24 MBytes       
[  4]   8.00-9.00   sec  1.88 GBytes  16.2 Gbits/sec    0   8.24 MBytes       
[  4]   9.00-10.00  sec  1.87 GBytes  16.1 Gbits/sec    0   8.24 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  18.9 GBytes  16.2 Gbits/sec    0             sender
[  4]   0.00-10.00  sec  18.9 GBytes  16.2 Gbits/sec                  receiver

iperf Done.

Here are all shell commands i used for volume creation with RDMA transport-type:
gluster volume create vol_home replica 2 transport rdma,tcp ib-storage1:/export/brick_home/brick1/ ib-storage2:/export/brick_home/brick1/ ib-storage3:/export/brick_home/brick1/ ib-storage4:/export/brick_home/brick1/ ib-storage1:/export/brick_home/brick2/ ib-storage2:/export/brick_home/brick2/ ib-storage3:/export/brick_home/brick2/ ib-storage4:/export/brick_home/brick2/ force

and below the current volume information:
[root@lucifer ~]# gluster volume info vol_home

Volume Name: vol_home
Type: Distributed-Replicate
Volume ID: f6ebcfc1-b735-4a0e-b1d7-47ed2d2e7af6
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp,rdma
Bricks:
Brick1: ib-storage1:/export/brick_home/brick1
Brick2: ib-storage2:/export/brick_home/brick1
Brick3: ib-storage3:/export/brick_home/brick1
Brick4: ib-storage4:/export/brick_home/brick1
Brick5: ib-storage1:/export/brick_home/brick2
Brick6: ib-storage2:/export/brick_home/brick2
Brick7: ib-storage3:/export/brick_home/brick2
Brick8: ib-storage4:/export/brick_home/brick2
Options Reconfigured:
performance.stat-prefetch: on
performance.flush-behind: on
features.default-soft-limit: 90%
features.quota: on
diagnostics.brick-log-level: CRITICAL
auth.allow: localhost,127.0.0.1,10.*
nfs.disable: on
performance.cache-size: 64MB
performance.write-behind-window-size: 1MB
performance.quick-read: on
performance.io-cache: on
performance.io-thread-count: 64
nfs.enable-ino32: on

and below my mount command:
mount -t glusterfs -o transport=rdma,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /home

I dont obtain any error with RDMA option but transport type silently fall back to TCP.

Did i make any mistake in my settings?

Can you tell me more about block size and other tunings i should do on my rdma volumes?

Thanks in advance,
Geoffrey

------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 8 juin 2015 à 18:22, Ben Turner <bturner@xxxxxxxxxx> a écrit :

----- Original Message -----
From: "Geoffrey Letessier" <geoffrey.letessier@xxxxxxx>
To: "Ben Turner" <bturner@xxxxxxxxxx>
Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, gluster-users@xxxxxxxxxxx
Sent: Monday, June 8, 2015 8:37:08 AM
Subject: Re:  GlusterFS 3.7 - slow/poor performances

Hello,

Do you know more about?

In addition, do you know how to « activate » RDMA for my volume with
Intel/QLogic QDR? Currently, i mount my volumes with RDMA transport-type
option (both in server and client side) but I notice all streams are using
TCP stack -and my bandwith never exceed 2.0-2.5Gbs (250-300MB/s).

That is a little slow for the HW you described.  Can you check what you get with iperf just between the clients and servers? https://iperf.fr/  With replica 2 and 10G NW you should see ~400 MB / sec sequential writes and ~600 MB / sec reads.  Can you send me the output from gluster v info?  You specify RDMA volumes at create time by running gluster v create blah transport rdma, did you specify RDMA when you created the volume?  What block size are you using in your tests?  1024 KB writes perform best with glusterfs, and the block size gets smaller perf will drop a little bit.  I wouldn't write in anything under 4k blocks, the sweet spot is between 64k and 1024k.

-b

Thanks in advance,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 2 juin 2015 à 23:45, Geoffrey Letessier <geoffrey.letessier@xxxxxxx> a
écrit :

Hi Ben,

I just check my messages log files, both on client and server, and I dont
find any hung task you notice on yours..

As you can read below, i dont note the performance issue in a simple DD but
I think my issue is concerning a set of small files (tens of thousands nay
more)…

[root@nisus test]# ddt -t 10g /mnt/test/
Writing to /mnt/test/ddt.8362 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/test/ddt.8362 ... done.
10240MiB    KiB/s  CPU%
Write      114770     4
Read        40675     4

for info: /mnt/test concerns the single v2 GlFS volume

[root@nisus test]# ddt -t 10g /mnt/fhgfs/
Writing to /mnt/fhgfs/ddt.8380 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/fhgfs/ddt.8380 ... done.
10240MiB    KiB/s  CPU%
Write      102591     1
Read        98079     2

Do you have a idea how to tune/optimize performance settings? and/or TCP
settings (MTU, etc.)?

---------------------------------------------------------------
|             |  UNTAR  |   DU   |  FIND   |   TAR   |   RM   |
---------------------------------------------------------------
| single      |  ~3m45s |   ~43s |    ~47s |  ~3m10s | ~3m15s |
---------------------------------------------------------------
| replicated  |  ~5m10s |   ~59s |   ~1m6s |  ~1m19s | ~1m49s |
---------------------------------------------------------------
| distributed |  ~4m18s |   ~41s |    ~57s |  ~2m24s | ~1m38s |
---------------------------------------------------------------
| dist-repl   |  ~8m18s |  ~1m4s |  ~1m11s |  ~1m24s | ~2m40s |
---------------------------------------------------------------
| native FS   |    ~11s |    ~4s |     ~2s |    ~56s |   ~10s |
---------------------------------------------------------------
| BeeGFS      |  ~3m43s |   ~15s |     ~3s |  ~1m33s |   ~46s |
---------------------------------------------------------------
| single (v2) |   ~3m6s |   ~14s |    ~32s |   ~1m2s |   ~44s |
---------------------------------------------------------------
for info:
	-BeeGFS is a distributed FS (4 bricks, 2 bricks per server and 2 servers)
	- single (v2): simple gluster volume with default settings

I also note I obtain the same tar/untar performance issue with FhGFS/BeeGFS
but the rest (DU, FIND, RM) looks like to be OK.

Thank you very much for your reply and help.
Geoffrey
-----------------------------------------------
Geoffrey Letessier

Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
<mailto:geoffrey.letessier@xxxxxxx>
Le 2 juin 2015 à 21:53, Ben Turner <bturner@xxxxxxxxxx
<mailto:bturner@xxxxxxxxxx>> a écrit :

I am seeing problems on 3.7 as well.  Can you check /var/log/messages on
both the clients and servers for hung tasks like:

Jun  2 15:23:14 gqac006 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  2 15:23:14 gqac006 kernel: iozone        D 0000000000000001     0
21999      1 0x00000080
Jun  2 15:23:14 gqac006 kernel: ffff880611321cc8 0000000000000082
ffff880611321c18 ffffffffa027236e
Jun  2 15:23:14 gqac006 kernel: ffff880611321c48 ffffffffa0272c10
ffff88052bd1e040 ffff880611321c78
Jun  2 15:23:14 gqac006 kernel: ffff88052bd1e0f0 ffff88062080c7a0
ffff880625addaf8 ffff880611321fd8
Jun  2 15:23:14 gqac006 kernel: Call Trace:
Jun  2 15:23:14 gqac006 kernel: [<ffffffffa027236e>] ?
rpc_make_runnable+0x7e/0x80 [sunrpc]
Jun  2 15:23:14 gqac006 kernel: [<ffffffffa0272c10>] ?
rpc_execute+0x50/0xa0 [sunrpc]
Jun  2 15:23:14 gqac006 kernel: [<ffffffff810aaa21>] ?
ktime_get_ts+0xb1/0xf0
Jun  2 15:23:14 gqac006 kernel: [<ffffffff811242d0>] ? sync_page+0x0/0x50
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8152a1b3>] io_schedule+0x73/0xc0
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8112430d>] sync_page+0x3d/0x50
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8152ac7f>]
__wait_on_bit+0x5f/0x90
Jun  2 15:23:14 gqac006 kernel: [<ffffffff81124543>]
wait_on_page_bit+0x73/0x80
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8109eb80>] ?
wake_bit_function+0x0/0x50
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8113a525>] ?
pagevec_lookup_tag+0x25/0x40
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8112496b>]
wait_on_page_writeback_range+0xfb/0x190
Jun  2 15:23:14 gqac006 kernel: [<ffffffff81124b38>]
filemap_write_and_wait_range+0x78/0x90
Jun  2 15:23:14 gqac006 kernel: [<ffffffff811c07ce>]
vfs_fsync_range+0x7e/0x100
Jun  2 15:23:14 gqac006 kernel: [<ffffffff811c08bd>] vfs_fsync+0x1d/0x20
Jun  2 15:23:14 gqac006 kernel: [<ffffffff811c08fe>] do_fsync+0x3e/0x60
Jun  2 15:23:14 gqac006 kernel: [<ffffffff811c0950>] sys_fsync+0x10/0x20
Jun  2 15:23:14 gqac006 kernel: [<ffffffff8100b072>]
system_call_fastpath+0x16/0x1b

Do you see a perf problem with just a simple DD or do you need a more
complex workload to hit the issue?  I think I saw an issue with metadata
performance that I am trying to run down, let me know if you can see the
problem with simple DD reads / writes or if we need to do some sort of
dir / metadata access as well.

-b

----- Original Message -----
From: "Geoffrey Letessier" <geoffrey.letessier@xxxxxxx
<mailto:geoffrey.letessier@xxxxxxx>>
To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>>
Cc: gluster-users@xxxxxxxxxxx <mailto:gluster-users@xxxxxxxxxxx>
Sent: Tuesday, June 2, 2015 8:09:04 AM
Subject: Re:  GlusterFS 3.7 - slow/poor performances

Hi Pranith,

I’m sorry but I cannot bring you any comparison because comparison will
be
distorted by the fact in my HPC cluster in production the network
technology
is InfiniBand QDR and my volumes are quite different (brick in RAID6
(12x2TB), 2 bricks per server and 4 servers into my pool)

Concerning your demand, in attachments you can find all expected results
hoping it can help you to solve this serious performance issue (maybe I
need
play with glusterfs parameters?).

Thank you very much by advance,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
<mailto:geoffrey.letessier@xxxxxxx>

Le 2 juin 2015 à 10:09, Pranith Kumar Karampuri < pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx> > a
écrit :

hi Geoffrey,
Since you are saying it happens on all types of volumes, lets do the
following:
1) Create a dist-repl volume
2) Set the options etc you need.
3) enable gluster volume profile using "gluster volume profile <volname>
start"
4) run the work load
5) give output of "gluster volume profile <volname> info"

Repeat the steps above on new and old version you are comparing this
with.
That should give us insight into what could be causing the slowness.

Pranith
On 06/02/2015 03:22 AM, Geoffrey Letessier wrote:

Dear all,

I have a crash test cluster where i’ve tested the new version of
GlusterFS
(v3.7) before upgrading my HPC cluster in production.
But… all my tests show me very very low performances.

For my benches, as you can read below, I do some actions (untar, du,
find,
tar, rm) with linux kernel sources, dropping cache, each on distributed,
replicated, distributed-replicated, single (single brick) volumes and the
native FS of one brick.

# time (echo 3 > /proc/sys/vm/drop_caches; tar xJf
~/linux-4.1-rc5.tar.xz;
sync; echo 3 > /proc/sys/vm/drop_caches)
# time (echo 3 > /proc/sys/vm/drop_caches; du -sh linux-4.1-rc5/; echo 3

/proc/sys/vm/drop_caches)
# time (echo 3 > /proc/sys/vm/drop_caches; find linux-4.1-rc5/|wc -l;
echo 3
/proc/sys/vm/drop_caches)
# time (echo 3 > /proc/sys/vm/drop_caches; tar czf linux-4.1-rc5.tgz
linux-4.1-rc5/; echo 3 > /proc/sys/vm/drop_caches)
# time (echo 3 > /proc/sys/vm/drop_caches; rm -rf linux-4.1-rc5.tgz
linux-4.1-rc5/; echo 3 > /proc/sys/vm/drop_caches)

And here are the process times:

---------------------------------------------------------------
| | UNTAR | DU | FIND | TAR | RM |
---------------------------------------------------------------
| single | ~3m45s | ~43s | ~47s | ~3m10s | ~3m15s |
---------------------------------------------------------------
| replicated | ~5m10s | ~59s | ~1m6s | ~1m19s | ~1m49s |
---------------------------------------------------------------
| distributed | ~4m18s | ~41s | ~57s | ~2m24s | ~1m38s |
---------------------------------------------------------------
| dist-repl | ~8m18s | ~1m4s | ~1m11s | ~1m24s | ~2m40s |
---------------------------------------------------------------
| native FS | ~11s | ~4s | ~2s | ~56s | ~10s |
---------------------------------------------------------------

I get the same results, whether with default configurations with custom
configurations.

if I look at the side of the ifstat command, I can note my IO write
processes
never exceed 3MBs...

EXT4 native FS seems to be faster (roughly 15-20% but no more) than XFS
one

My [test] storage cluster config is composed by 2 identical servers
(biCPU
Intel Xeon X5355, 8GB of RAM, 2x2TB HDD (no-RAID) and Gb ethernet)

My volume settings:
single: 1server 1 brick
replicated: 2 servers 1 brick each
distributed: 2 servers 2 bricks each
dist-repl: 2 bricks in the same server and replica 2

All seems to be OK in gluster status command line.

Do you have an idea why I obtain so bad results?
Thanks in advance.
Geoffrey
-----------------------------------------------
Geoffrey Letessier

Responsable informatique & ingénieur système
CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
<mailto:geoffrey.letessier@xxxxxxx>

_______________________________________________
Gluster-users mailing list Gluster-users@xxxxxxxxxxx
<mailto:Gluster-users@xxxxxxxxxxx>
http://www.gluster.org/mailman/listinfo/gluster-users
<http://www.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
http://www.gluster.org/mailman/listinfo/gluster-users
<http://www.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users