I am seeing problems on 3.7 as well. Can you check /var/log/messages on both the clients and servers for hung tasks like: Jun 2 15:23:14 gqac006 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 2 15:23:14 gqac006 kernel: iozone D 0000000000000001 0 21999 1 0x00000080 Jun 2 15:23:14 gqac006 kernel: ffff880611321cc8 0000000000000082 ffff880611321c18 ffffffffa027236e Jun 2 15:23:14 gqac006 kernel: ffff880611321c48 ffffffffa0272c10 ffff88052bd1e040 ffff880611321c78 Jun 2 15:23:14 gqac006 kernel: ffff88052bd1e0f0 ffff88062080c7a0 ffff880625addaf8 ffff880611321fd8 Jun 2 15:23:14 gqac006 kernel: Call Trace: Jun 2 15:23:14 gqac006 kernel: [<ffffffffa027236e>] ? rpc_make_runnable+0x7e/0x80 [sunrpc] Jun 2 15:23:14 gqac006 kernel: [<ffffffffa0272c10>] ? rpc_execute+0x50/0xa0 [sunrpc] Jun 2 15:23:14 gqac006 kernel: [<ffffffff810aaa21>] ? ktime_get_ts+0xb1/0xf0 Jun 2 15:23:14 gqac006 kernel: [<ffffffff811242d0>] ? sync_page+0x0/0x50 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8152a1b3>] io_schedule+0x73/0xc0 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8112430d>] sync_page+0x3d/0x50 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8152ac7f>] __wait_on_bit+0x5f/0x90 Jun 2 15:23:14 gqac006 kernel: [<ffffffff81124543>] wait_on_page_bit+0x73/0x80 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8109eb80>] ? wake_bit_function+0x0/0x50 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8113a525>] ? pagevec_lookup_tag+0x25/0x40 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8112496b>] wait_on_page_writeback_range+0xfb/0x190 Jun 2 15:23:14 gqac006 kernel: [<ffffffff81124b38>] filemap_write_and_wait_range+0x78/0x90 Jun 2 15:23:14 gqac006 kernel: [<ffffffff811c07ce>] vfs_fsync_range+0x7e/0x100 Jun 2 15:23:14 gqac006 kernel: [<ffffffff811c08bd>] vfs_fsync+0x1d/0x20 Jun 2 15:23:14 gqac006 kernel: [<ffffffff811c08fe>] do_fsync+0x3e/0x60 Jun 2 15:23:14 gqac006 kernel: [<ffffffff811c0950>] sys_fsync+0x10/0x20 Jun 2 15:23:14 gqac006 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Do you see a perf problem with just a simple DD or do you need a more complex workload to hit the issue? I think I saw an issue with metadata performance that I am trying to run down, let me know if you can see the problem with simple DD reads / writes or if we need to do some sort of dir / metadata access as well. -b ----- Original Message ----- > From: "Geoffrey Letessier" <geoffrey.letessier@xxxxxxx> > To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > Cc: gluster-users@xxxxxxxxxxx > Sent: Tuesday, June 2, 2015 8:09:04 AM > Subject: Re: GlusterFS 3.7 - slow/poor performances > > Hi Pranith, > > I’m sorry but I cannot bring you any comparison because comparison will be > distorted by the fact in my HPC cluster in production the network technology > is InfiniBand QDR and my volumes are quite different (brick in RAID6 > (12x2TB), 2 bricks per server and 4 servers into my pool) > > Concerning your demand, in attachments you can find all expected results > hoping it can help you to solve this serious performance issue (maybe I need > play with glusterfs parameters?). > > Thank you very much by advance, > Geoffrey > ------------------------------------------------------ > Geoffrey Letessier > Responsable informatique & ingénieur système > UPR 9080 - CNRS - Laboratoire de Biochimie Théorique > Institut de Biologie Physico-Chimique > 13, rue Pierre et Marie Curie - 75005 Paris > Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx > > > > > Le 2 juin 2015 à 10:09, Pranith Kumar Karampuri < pkarampu@xxxxxxxxxx > a > écrit : > > hi Geoffrey, > Since you are saying it happens on all types of volumes, lets do the > following: > 1) Create a dist-repl volume > 2) Set the options etc you need. > 3) enable gluster volume profile using "gluster volume profile <volname> > start" > 4) run the work load > 5) give output of "gluster volume profile <volname> info" > > Repeat the steps above on new and old version you are comparing this with. > That should give us insight into what could be causing the slowness. > > Pranith > On 06/02/2015 03:22 AM, Geoffrey Letessier wrote: > > > Dear all, > > I have a crash test cluster where i’ve tested the new version of GlusterFS > (v3.7) before upgrading my HPC cluster in production. > But… all my tests show me very very low performances. > > For my benches, as you can read below, I do some actions (untar, du, find, > tar, rm) with linux kernel sources, dropping cache, each on distributed, > replicated, distributed-replicated, single (single brick) volumes and the > native FS of one brick. > > # time (echo 3 > /proc/sys/vm/drop_caches; tar xJf ~/linux-4.1-rc5.tar.xz; > sync; echo 3 > /proc/sys/vm/drop_caches) > # time (echo 3 > /proc/sys/vm/drop_caches; du -sh linux-4.1-rc5/; echo 3 > > /proc/sys/vm/drop_caches) > # time (echo 3 > /proc/sys/vm/drop_caches; find linux-4.1-rc5/|wc -l; echo 3 > > /proc/sys/vm/drop_caches) > # time (echo 3 > /proc/sys/vm/drop_caches; tar czf linux-4.1-rc5.tgz > linux-4.1-rc5/; echo 3 > /proc/sys/vm/drop_caches) > # time (echo 3 > /proc/sys/vm/drop_caches; rm -rf linux-4.1-rc5.tgz > linux-4.1-rc5/; echo 3 > /proc/sys/vm/drop_caches) > > And here are the process times: > > --------------------------------------------------------------- > | | UNTAR | DU | FIND | TAR | RM | > --------------------------------------------------------------- > | single | ~3m45s | ~43s | ~47s | ~3m10s | ~3m15s | > --------------------------------------------------------------- > | replicated | ~5m10s | ~59s | ~1m6s | ~1m19s | ~1m49s | > --------------------------------------------------------------- > | distributed | ~4m18s | ~41s | ~57s | ~2m24s | ~1m38s | > --------------------------------------------------------------- > | dist-repl | ~8m18s | ~1m4s | ~1m11s | ~1m24s | ~2m40s | > --------------------------------------------------------------- > | native FS | ~11s | ~4s | ~2s | ~56s | ~10s | > --------------------------------------------------------------- > > I get the same results, whether with default configurations with custom > configurations. > > if I look at the side of the ifstat command, I can note my IO write processes > never exceed 3MBs... > > EXT4 native FS seems to be faster (roughly 15-20% but no more) than XFS one > > My [test] storage cluster config is composed by 2 identical servers (biCPU > Intel Xeon X5355, 8GB of RAM, 2x2TB HDD (no-RAID) and Gb ethernet) > > My volume settings: > single: 1server 1 brick > replicated: 2 servers 1 brick each > distributed: 2 servers 2 bricks each > dist-repl: 2 bricks in the same server and replica 2 > > All seems to be OK in gluster status command line. > > Do you have an idea why I obtain so bad results? > Thanks in advance. > Geoffrey > ----------------------------------------------- > Geoffrey Letessier > > Responsable informatique & ingénieur système > CNRS - UPR 9080 - Laboratoire de Biochimie Théorique > Institut de Biologie Physico-Chimique > 13, rue Pierre et Marie Curie - 75005 Paris > Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx > > > > _______________________________________________ > Gluster-users mailing list Gluster-users@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-users > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users