Re: Gluster linear scale-out performance

Ravishankar N <ravishankar@xxxxxxxxxx> · Fri, 31 Jul 2020 08:50:32 +0530

    On 25/07/20 4:35 am, Artem Russakovskii
      wrote:

      Speaking of fio, could the gluster team please help
        me understand something?

        We've been having lots of performance issues related to
          gluster using attached block storage on Linode. At some point,
          I figured out that Linode has a cap of 500 IOPS on their block
            storage (with spikes to 1500 IOPS). The block storage we
          use is formatted xfs with 4KB bsize (block size). 

        I then ran a bunch of fio tests on the block storage itself
          (not the gluster fuse mount), which performed horribly when
          the bs parameter was set to 4k: 
        fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite --ramp_time=4

          During these tests, fio ETA crawled to over an hour, at some
          point dropped to 45min and I did see 500-1500 IOPS flash by
          briefly, then it went back down to 0. I/O seems majorly choked
          for some reason, likely because gluster is using some of it.
          Transfer speed with such 4k block size is 2 MB/s with spikes
          to 6MB/s. This causes the load on the server to spike up to
          100+ and brings down all our servers.
          Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477 IOPS][eta 43m:00s]    
Jobs: 1 (f=1): [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s]      

          xfs_info /mnt/citadel_block1
meta-data=""               isize=512    agcount=103, agsize=26214400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       ""   blocks=2684354560, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=51200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

                        When I increase the --bs param to fio from
                          4k to, say, 64k, transfer speed goes up
                          significantly and is more like 50MB/s, and at
                          256k, it's 200MB/s.

                        So what I'm trying to understand is:

                            How does the xfs block size (4KB) relate
                              to the block size in fio tests? If we're
                              limited by IOPS, and xfs block size is
                              4KB, how can fio produce better results
                              with varying --bs param? 
                            Would increasing the xfs data block size
                              to something like 64-256KB help with our
                              issue of choking IO and skyrocketing load?

    I have experienced similar behavior when running fio tests with
    bs=4k on a gluster volume backed by XFS with a high load
    (numjobs=32) . When I observed the strace of the brick processes
    (fsync -f -T -p $PID), I saw fysnc system calls taking around 2500
    seconds which is insane. I'm not sure if this is specific to the way
    fio does its i/o pattern and the way XFS handles it. When I used 64k
    block sizes, the fio tests completed just fine.

                            The worst hangs and load spikes happen
                              when we reboot one of the gluster servers,
                              but not when it's down - when it comes
                              back online. Even with gluster not showing
                              anything pending heal, my guess is it's
                              still trying to do lots of IO between the
                              4 nodes for some reason, but I don't
                              understand why.

    Do you kill all gluster processes (not just glusterd but even the
      brick processes) before issuing reboot? This is necessary to
      prevent I/O stalls. There is stop-all-gluster-processes.sh which
      should be available as a part of the gluster installation (maybe
      in /usr/share/glusterfs/scripts/) which you can use.  Can you
      check if this helps?

    Regards,
    Ravi

                        I've been banging my head on the wall with
                          this problem for months. Appreciate any
                          feedback here.

                        Thank you.

                        gluster volume info below

                          Volume Name: SNIP_data1
Type: Replicate
Volume ID: SNIP
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1
Brick2: forge:/mnt/SNIP_block1/SNIP_data1
Brick3: hive:/mnt/SNIP_block1/SNIP_data1
Brick4: citadel:/mnt/SNIP_block1/SNIP_data1
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
network.ping-timeout: 5
network.remote-dio: enable
performance.rda-cache-limit: 256MB
performance.readdir-ahead: on
performance.parallel-readdir: on
network.inode-lru-limit: 500000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.readdir-optimize: on
performance.io-thread-count: 32
server.event-threads: 4
client.event-threads: 4
performance.read-ahead: off
cluster.lookup-optimize: on
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full

                          Sincerely,

                          Artem

                          --

                          Founder, Android
                            Police, APK Mirror, Illogical Robot
                            LLC
                        beerpla.net
                          | @ArtemR

        On Thu, Jul 23, 2020 at 12:08
          AM Qing Wang <qw@xxxxxxxxxxxxx>
          wrote:

                                  Hi, 

                                  I
                                    have one more question about the
                                    Gluster linear scale-out performance
                                    regarding the "write-behind off"
                                    case specifically -- when
                                    "write-behind" is off, and still the
                                    stripe volumes and other settings as
                                    early thread posted, the storage
                                      I/O seems not to relate to the
                                      number of storage nodes. In my
                                      experiment, no matter I have 2
                                      brick server nodes or 8 brick
                                      server nodes, the aggregated
                                      gluster I/O performance is
                                      ~100MB/sec. And fio benchmark
                                      measurement gives the same result.
                                      If "write behind" is on, then the
                                      storage performance is linear
                                      scale-out along with the # of
                                      brick server nodes increasing. 

                                  No
                                    matter the write behind option is
                                    on/off, I thought the gluster I/O
                                    performance should be pulled and
                                    aggregated together as a whole. If
                                    that is the case, why do I get a
                                    consistent gluster performance
                                    (~100MB/sec) when "write behind" is
                                    off? Please advise me if I
                                    misunderstood something. 

                                  Thanks,
                                  Qing 

            On Tue, Jul 21, 2020 at
              7:29 PM Qing Wang <qw@xxxxxxxxxxxxx>
              wrote:

                                      fio
                                        gives me the correct linear
                                        scale-out results, and you're
                                        right, the storage cache is the
                                        root cause that makes the dd
                                        measurement results not accurate
                                        at all. 

                                      Thanks,
                                      Qing 

                On Tue, Jul 21, 2020
                  at 2:53 PM Yaniv Kaul <ykaul@xxxxxxxxxx>
                  wrote:

                        On Tue, 21 Jul
                          2020, 21:43 Qing Wang <qw@xxxxxxxxxxxxx>
                          wrote:

                          Hi Yaniv,

                            Thanks for the quick response. I forget
                              to mention I am testing the writing
                              performance, not reading. In this case,
                              would the client cache hit rate still be a
                              big issue? 

                    It's not hitting the storage
                      directly. Since it's also single threaded, it may
                      also not saturate it. I highly recommend testing
                      properly. 
                    Y. 

                            I'll use fio to run my test once again,
                              thanks for the suggestion. 

                            Thanks,
                            Qing 

                            On Tue,
                              Jul 21, 2020 at 2:38 PM Yaniv Kaul <ykaul@xxxxxxxxxx>
                              wrote:

                                    On
                                      Tue, 21 Jul 2020, 21:30 Qing Wang
                                      <qw@xxxxxxxxxxxxx>
                                      wrote:

                                                          Hi, 

                                                          I
                                                          am trying to
                                                          test Gluster
                                                          linear
                                                          scale-out
                                                          performance by
                                                          adding more
                                                          storage
                                                          server/bricks,
                                                          and measure
                                                          the storage
                                                          I/O
                                                          performance.
                                                          To vary the
                                                          storage server
                                                          number, I
                                                          create several
                                                          "stripe"
                                                          volumes that
                                                          contain 2
                                                          brick servers,
                                                          3 brick
                                                          servers, 4
                                                          brick servers,
                                                          and so on. On
                                                          gluster client
                                                          side, I used
                                                          "dd
                                                          if=/dev/zero
                                                          of=/mnt/glusterfs/dns_test_data_26g
                                                          bs=1M
                                                          count=26000"
                                                          to create 26G
                                                          data (or
                                                          larger size),
                                                          and those data
                                                          will be
                                                          distributed to
                                                          the
                                                          corresponding
gluster servers (each has gluster brick on it) and "dd" returns the
                                                          final I/O
                                                          throughput.
                                                          The Internet
                                                          is 40G
                                                          infiniband,
                                                          although I
                                                          didn't do any
                                                          specific
                                                          configurations
                                                          to use
                                                          advanced
                                                          features. 

                                Your dd command is
                                  inaccurate, as it'll hit the client
                                  cache. It is also single threaded. I
                                  suggest switching to fio. 
                                Y. 

                                                          What
                                                          confuses me is
                                                          that the
                                                          storage I/O
                                                          seems not to
                                                          relate to the
                                                          number of
                                                          storage
                                                          nodes, but
                                                          Gluster
                                                          documents said
                                                          it should be
                                                          linear
                                                          scaling. For
                                                          example, when
                                                          "write-behind"
                                                          is on, and
                                                          when
                                                          Infiniband
                                                          "jumbo frame"
                                                          (connected
                                                          mode) is on, I
                                                          can get ~800
                                                          MB/sec
                                                          reported by
                                                          "dd", no
                                                          matter I have
                                                          2 brick
                                                          servers or 8
                                                          brick servers
                                                          -- for 2
                                                          server case,
                                                          each server
                                                          can have ~400
                                                          MB/sec; for 4
                                                          server case,
                                                          each server
                                                          can have
                                                          ~200MB/sec.
                                                          That said,
                                                          each server
                                                          I/O does
                                                          aggregate to
                                                          the final
                                                          storage I/O
                                                          (800 MB/sec),
                                                          but this is
                                                          not "linear
                                                          scale-out". 

                                                          Can
                                                          somebody help
                                                          me to
                                                          understand why
                                                          this is the
                                                          case? I
                                                          certainly can
                                                          have some
                                                          misunderstanding/misconfiguration
                                                          here. Please
                                                          correct me if
                                                          I do, thanks! 

                                                          Best,
                                                          Qing

                                      ________

                                      Community Meeting Calendar:

                                      Schedule -

                                      Every 2nd and 4th Tuesday at 14:30
                                      IST / 09:00 UTC

                                      Bridge: https://bluejeans.com/441850968

                                      Gluster-users mailing list

                                      Gluster-users@xxxxxxxxxxx

                                      https://lists.gluster.org/mailman/listinfo/gluster-users

          ________

          Community Meeting Calendar:

          Schedule -

          Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

          Bridge: https://bluejeans.com/441850968

          Gluster-users mailing list

          Gluster-users@xxxxxxxxxxx

          https://lists.gluster.org/mailman/listinfo/gluster-users

      ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users