Re: performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs

Artem Russakovskii <archon810@xxxxxxxxx> · Tue, 17 Apr 2018 23:22:37 -0700

Thanks for the link. Looking at the status of that doc, it isn't quite ready yet, and there's no mention of the option.
Does it mean that whatever is ready now in 4.0.1 is incomplete but can be enabled via granular-entry-heal=on, and when it is complete, it'll become the default and the flag will simply go away?
Is there any risk enabling the option now in 4.0.1?

Sincerely,
Artem

--
Founder, Android Police, APK Mirror, Illogical Robot LLC
beerpla.net | +ArtemRussakovskii | @ArtemR

On Tue, Apr 17, 2018 at 11:16 PM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:

    On 04/18/2018 10:35 AM, Artem
      Russakovskii wrote:

      Hi Ravi,

        Could you please expand on how these would help?

        By forcing full here, we move the logic from the CPU to
          network, thus decreasing CPU utilization, is that right?

    Yes, 'diff' employs the rchecksum FOP which does a sha256  checksum
    which can consume CPU. So yes it is sort of shifting the load from
    CPU to the network. But if your average file size is small, it would
    make sense to copy the entire file instead of computing checksums.

         This is assuming the CPU and disk utilization are caused
          by the differ and not by lstat and other calls or something.

          Option:
            cluster.data-self-heal-algorithm

            Default Value: (null)

            Description: Select between "full", "diff". The "full"
            algorithm copies the entire file from source to sink. The
            "diff" algorithm copies to sink only those blocks whose
            checksums don't match with those of source. If no option is
            configured the option is chosen dynamically as follows: If
            the file does not exist on one of the sinks or empty file
            exists or if the source file size is about the same as page
            size the entire file will be read and written i.e "full"
            algo, otherwise "diff" algo is chosen.

        I really have no idea what this means and how/why it would
          help. Any more info on this option?

https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.8/granular-entry-self-healing.md
    should help.

    Regards,

    Ravi

          Option:
            cluster.granular-entry-heal

            Default Value: no

            Description: If this option is enabled, self-heal will
            resort to granular way of recording changelogs and doing
            entry self-heal.

        Thank you.

                              Sincerely,

                              Artem

                              --

                              Founder, Android
                                Police, APK
                                Mirror,
                                Illogical Robot LLC
                            beerpla.net
                              | +ArtemRussakovskii
                              | @ArtemR

        On Tue, Apr 17, 2018 at 9:58 PM,
          Ravishankar N <ravishankar@xxxxxxxxxx>
          wrote:

                  On
                    04/18/2018 10:14 AM, Artem Russakovskii wrote:

                    Following up here on a related and
                      very serious for us issue.

                      I took down one of the 4 replicate gluster
                        servers for maintenance today. There are 2
                        gluster volumes totaling about 600GB. Not that
                        much data. After the server comes back online,
                        it starts auto healing and pretty much all
                        operations on gluster freeze for many minutes.

                      For example, I was trying to run an ls -alrt
                        in a folder with 7300 files, and it took a good
                        15-20 minutes before returning.

                      During this time, I can see iostat show 100%
                        utilization on the brick, heal status takes many
                        minutes to return, glusterfsd uses up tons of
                        CPU (I saw it spike to 600%). gluster already
                        has massive performance issues for me, but
                        healing after a 4-hour downtime is on another
                        level of bad perf.

                      For example, this command took many minutes
                        to run:

                        gluster volume heal androidpolice_data3
                          info summary
                        Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
                        Status: Connected
                        Total Number of entries: 91
                        Number of entries in heal pending: 90
                        Number of entries in split-brain: 0
                        Number of entries possibly healing: 1

                        Brick forge:/mnt/forge_block4/androidpolice_data3
                        Status: Connected
                        Total Number of entries: 87
                        Number of entries in heal pending: 86
                        Number of entries in split-brain: 0
                        Number of entries possibly healing: 1

                        Brick hive:/mnt/hive_block4/androidpolice_data3
                        Status: Connected
                        Total Number of entries: 87
                        Number of entries in heal pending: 86
                        Number of entries in split-brain: 0
                        Number of entries possibly healing: 1

                        Brick citadel:/mnt/citadel_block4/androidpolice_data3
                        Status: Connected
                        Total Number of entries: 0
                        Number of entries in heal pending: 0
                        Number of entries in split-brain: 0
                        Number of entries possibly healing: 0

                      Statistics showed a diminishing number of
                        failed heals:
                      ...

                        Ending time of crawl: Tue Apr 17 21:13:08
                          2018

                        Type of crawl: INDEX
                        No. of entries healed: 2
                        No. of entries in split-brain: 0
                        No. of heal failed entries: 102

                        Starting time of crawl: Tue Apr 17 21:13:09
                          2018

                        Ending time of crawl: Tue Apr 17 21:14:30
                          2018

                        Type of crawl: INDEX
                        No. of entries healed: 4
                        No. of entries in split-brain: 0
                        No. of heal failed entries: 91

                        Starting time of crawl: Tue Apr 17 21:14:31
                          2018

                        Ending time of crawl: Tue Apr 17 21:15:34
                          2018

                        Type of crawl: INDEX
                        No. of entries healed: 0
                        No. of entries in split-brain: 0
                        No. of heal failed entries: 88

                      ...

                      Eventually, everything heals and goes back to
                        at least where the roof isn't on fire anymore.

                      The server stats and volume options were
                        given in one of the previous replies to this
                        thread.

                      Any ideas or things I could run and show the
                        output of to help diagnose? I'm also very open
                        to working with someone on the team on a live
                        debugging session if there's interest.

              It is likely that self-heal is causing the CPU spike due
              to the flood of lookups/ locks and checksum fops that the
              self-heal-daemon sends to the bricks.

              There's a script to control shd's cpu usage using cgroups.
              That should help in regulating self-heal traffic: https://review.gluster.org/#/c/18404/
              (see extras/control-cpu-load.sh)

              Other self-heal related volume options that you could
              change are setting 'cluster.data-self-heal-algorithm'
              to 'full' and 'granular-entry-heal' to 'enable'.  `gluster
              volume set help` should give you more information about
              these options.

              Thanks,

              Ravi

                      Thank you.

                                            Sincerely,

                                            Artem

                                            --

                                            Founder, Android
                                              Police, APK
                                              Mirror,
                                              Illogical Robot LLC
                                          beerpla.net
                                            | +ArtemRussakovskii
                                            | @ArtemR

                      On Tue, Apr 10, 2018 at
                        9:56 AM, Artem Russakovskii <archon810@xxxxxxxxx>
                        wrote:

                          Hi Vlad,

                            I actually saw that post already and
                              even asked a question 4 days ago (https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917).
                              The accepted answer also seems to go
                              against your suggestion to enable
                              direct-io-mode as it says it should be
                              disabled for better performance when used
                              just for file accesses.

                            It'd be great if someone from the
                              Gluster team chimed in about this thread.

                                                    Sincerely,

                                                    Artem

                                                    --

                                                    Founder, Android
                                                      Police, APK
                                                      Mirror,
                                                      Illogical Robot
                                                      LLC
                                                  beerpla.net
                                                    | +ArtemRussakovskii
                                                    | @ArtemR

                                On Tue, Apr 10,
                                  2018 at 7:01 AM, Vlad Kopylov <vladkopy@xxxxxxxxx>
                                  wrote:

                                        Wish I knew or was able to
                                          get detailed description of
                                          those options myself.

                                        here is direct-io-mode  https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode

                                        Same as you I ran tests on a
                                        large volume of files, finding
                                        that main delays are in
                                        attribute calls, ending up with
                                        those mount options to add
                                        performance.

                                      I discovered those options
                                        through basically googling this
                                        user list with people sharing
                                        their tests.

                                      Not sure I would share your
                                      optimism, and rather then going up
                                      I downgraded to 3.12 and have no
                                      dir view issue now. Though I had
                                      to recreate the cluster and had to
                                      re-add bricks with existing data.

                                          On
                                            Tue, Apr 10, 2018 at 1:47
                                            AM, Artem Russakovskii <archon810@xxxxxxxxx>
                                            wrote:

                                              Hi Vlad,

                                                I'm using only
                                                  localhost: mounts.

                                                Can you please
                                                  explain what effect
                                                  each option has on
                                                  performance issues
                                                  shown in my posts?
                                                  "negative-timeout=10,attribute-timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5"
                                                  From what I remember,
                                                  direct-io-mode=enable
                                                  didn't make a
                                                  difference in my
                                                  tests, but I suppose I
                                                  can try again. The
                                                  explanations about
                                                  direct-io-mode are
                                                  quite confusing on the
                                                  web in various guides,
                                                  saying enabling it
                                                  could make performance
                                                  worse in some
                                                  situations and better
                                                  in others due to OS
                                                  file cache.

                                                There are also
                                                  these gluster volume
                                                  settings, adding to
                                                  the confusion:

                                                  Option:
                                                    performance.strict-o-direct
                                                  Default Value:
                                                    off
                                                  Description: This
                                                    option when set to
                                                    off, ignores the
                                                    O_DIRECT flag.

                                                  Option:
                                                    performance.nfs.strict-o-direct
                                                  Default Value:
                                                    off
                                                  Description: This
                                                    option when set to
                                                    off, ignores the
                                                    O_DIRECT flag.

                                                Re: 4.0. I moved to
                                                  4.0 after finding out
                                                  that it fixes the
                                                  disappearing dirs bug
                                                  related to
                                                  cluster.readdir-optimize
                                                  if you remember (http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html).
                                                  I was already on 3.13
                                                  by then, and 4.0
                                                  resolved the issue.
                                                  It's been stable for
                                                  me so far, thankfully.

                                                          Sincerely,

                                                          Artem

                                                          --

                                                          Founder, Android Police, APK
                                                          Mirror, Illogical Robot LLC
                                                          beerpla.net
                                                          | +ArtemRussakovskii | @ArtemR

                                                      On
                                                        Mon, Apr 9, 2018
                                                        at 10:38 PM,
                                                        Vlad Kopylov <vladkopy@xxxxxxxxx>
                                                        wrote:

                                                          you
                                                          definitely
                                                          need mount
                                                          options to
                                                          /etc/fstab

                                                          use ones from
                                                          here http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html

                                                          I went on
                                                          with using
                                                          local mounts
                                                          to achieve
                                                          performance as
                                                          well

                                                          Also, 3.12 or
                                                          3.10 branches
                                                          would be
                                                          preferable for
                                                          production 

                                                          On
                                                          Fri, Apr 6,
                                                          2018 at 4:12
                                                          AM, Artem
                                                          Russakovskii <archon810@xxxxxxxxx>
                                                          wrote:

                                                          Hi
                                                          again,

                                                          I'd like
                                                          to expand on
                                                          the
                                                          performance
                                                          issues and
                                                          plead for
                                                          help. Here's
                                                          one case which
                                                          shows these
                                                          odd hiccups: https://i.imgur.com/CXBPjTK.gifv.

                                                          In this
                                                          GIF where I
                                                          switch back
                                                          and forth
                                                          between copy
                                                          operations on
                                                          2 servers, I'm
                                                          copying a 10GB
                                                          dir full of
                                                          .apk and image
                                                          files.

                                                          On server
                                                          "hive" I'm
                                                          copying
                                                          straight from
                                                          the main disk
                                                          to an attached
                                                          volume block
                                                          (xfs). As you
                                                          can see, the
                                                          transfers are
                                                          relatively
                                                          speedy and
                                                          don't hiccup.
                                                          On server
                                                          "citadel" I'm
                                                          copying the
                                                          same set of
                                                          data to a
                                                          4-replicate
                                                          gluster which
                                                          uses block
                                                          storage as a
                                                          brick. As you
                                                          can see,
                                                          performance is
                                                          much worse,
                                                          and there are
                                                          frequent
                                                          pauses for
                                                          many seconds
                                                          where nothing
                                                          seems to be
                                                          happening -
                                                          just freezes.

                                                          All 4
                                                          servers have
                                                          the same
                                                          specs, and all
                                                          of them have
                                                          performance
                                                          issues with
                                                          gluster and no
                                                          such issues
                                                          when raw xfs
                                                          block storage
                                                          is used.

                                                          hive has
                                                          long finished
                                                          copying the
                                                          data, while
                                                          citadel is
                                                          barely
                                                          chugging along
                                                          and is
                                                          expected to
                                                          take probably
                                                          half an hour
                                                          to an hour. I
                                                          have over 1TB
                                                          of data to
                                                          migrate, at
                                                          which point if
                                                          we went live,
                                                          I'm not even
                                                          sure gluster
                                                          would be able
                                                          to keep up
                                                          instead of
                                                          bringing the
                                                          machines and
                                                          services down.

                                                          Here's
                                                          the cluster
                                                          config, though
                                                          it didn't seem
                                                          to make any
                                                          difference
                                                          performance-wise
                                                          before I
                                                          applied the
                                                          customizations
                                                          vs after.

                                                          Volume
                                                          Name:
                                                          apkmirror_data1
                                                          Type:
                                                          Replicate
                                                          Volume
                                                          ID:
                                                          11ecee7e-d4f8-497a-9994-ceb144d6841e
                                                          Status:
                                                          Started
                                                          Snapshot
                                                          Count: 0
                                                          Number of
                                                          Bricks: 1 x 4
                                                          = 4
                                                          Transport-type:
                                                          tcp
                                                          Bricks:
                                                          Brick1:
                                                          nexus2:/mnt/nexus2_block1/apkmirror_data1
                                                          Brick2:
                                                          forge:/mnt/forge_block1/apkmirror_data1
                                                          Brick3:
                                                          hive:/mnt/hive_block1/apkmirror_data1
                                                          Brick4:
                                                          citadel:/mnt/citadel_block1/apkmirror_data1
                                                          Options
                                                          Reconfigured:
                                                          cluster.quorum-count:
                                                          1
                                                          cluster.quorum-type:
                                                          fixed
                                                          network.ping-timeout:
                                                          5
                                                          network.remote-dio:
                                                          enable
                                                          performance.rda-cache-limit:
                                                          256MB
                                                          performance.readdir-ahead:
                                                          on
                                                          performance.parallel-readdir:
                                                          on
                                                          network.inode-lru-limit:
                                                          500000
                                                          performance.md-cache-timeout:
                                                          600
                                                          performance.cache-invalidation:
                                                          on
                                                          performance.stat-prefetch:
                                                          on
                                                          features.cache-invalidation-timeout:
                                                          600
                                                          features.cache-invalidation:
                                                          on
                                                          cluster.readdir-optimize:
                                                          on
                                                          performance.io-thread-count:
                                                          32
                                                          server.event-threads:
                                                          4
                                                          client.event-threads:
                                                          4
                                                          performance.read-ahead:
                                                          off
                                                          cluster.lookup-optimize:
                                                          on
                                                          performance.cache-size:
                                                          1GB
                                                          cluster.self-heal-daemon:
                                                          enable
                                                          transport.address-family:
                                                          inet
                                                          nfs.disable:
                                                          on
                                                          performance.client-io-threads:
                                                          on

                                                          The
                                                          mounts are
                                                          done as
                                                          follows in
                                                          /etc/fstab:
                                                          /dev/disk/by-id/scsi-0Linode_Volume_citadel_block1
/mnt/citadel_block1 xfs defaults 0 2

                                                          localhost:/apkmirror_data1
/mnt/apkmirror_data1 glusterfs defaults,_netdev 0 0

                                                          I'm
                                                          really not
                                                          sure if
                                                          direct-io-mode
                                                          mount tweaks
                                                          would do
                                                          anything here,
                                                          what the value
                                                          should be set
                                                          to, and what
                                                          it is by
                                                          default.

                                                          The OS is
                                                          OpenSUSE 42.3,
                                                          64-bit. 80GB
                                                          of RAM, 20
                                                          CPUs, hosted
                                                          by Linode.

                                                          I'd
                                                          really
                                                          appreciate any
                                                          help in the
                                                          matter. 

                                                          Thank
                                                          you.

                                                          Sincerely,

                                                          Artem

                                                          --

                                                          Founder, Android Police, APK
                                                          Mirror, Illogical Robot LLC
                                                          beerpla.net
                                                          | +ArtemRussakovskii | @ArtemR

                                                          On
                                                          Thu, Apr 5,
                                                          2018 at 11:13
                                                          PM, Artem
                                                          Russakovskii <archon810@xxxxxxxxx>
                                                          wrote:

                                                          Hi,

                                                          I'm
                                                          trying to
                                                          squeeze
                                                          performance
                                                          out of gluster
                                                          on 4 80GB RAM
                                                          20-CPU
                                                          machines where
                                                          Gluster runs
                                                          on attached
                                                          block storage
                                                          (Linode) in (4
                                                          replicate
                                                          bricks), and
                                                          so far
                                                          everything I
                                                          tried results
                                                          in sub-optimal
                                                          performance.

                                                          There are
                                                          many files -
                                                          mostly images,
                                                          several
                                                          million - and
                                                          many
                                                          operations
                                                          take minutes,
                                                          copying
                                                          multiple files
                                                          (even if
                                                          they're small)
                                                          suddenly
                                                          freezes up for
                                                          seconds at a
                                                          time, then
                                                          continues,
                                                          iostat
                                                          frequently
                                                          shows large
                                                          r_await and
                                                          w_awaits with
                                                          100%
                                                          utilization
                                                          for the
                                                          attached block
                                                          device, etc.

                                                          But
                                                          anyway, there
                                                          are many
                                                          guides out
                                                          there for
                                                          small-file
                                                          performance
                                                          improvements,
                                                          but more
                                                          explanation is
                                                          needed, and I
                                                          think more
                                                          tweaks should
                                                          be possible.

                                                          My
                                                          question today
                                                          is
                                                          about performance.cache-size.
                                                          Is this a size
                                                          of cache in
                                                          RAM? If so,
                                                          how do I view
                                                          the current
                                                          cache size to
                                                          see if it gets
                                                          full and I
                                                          should
                                                          increase its
                                                          size? Is it
                                                          advisable to
                                                          bump it up if
                                                          I have many
                                                          tens of gigs
                                                          of RAM free? 

                                                          More
                                                          generally, in
                                                          the last 2
                                                          months since I
                                                          first started
                                                          working with
                                                          gluster and
                                                          set a
                                                          production
                                                          system live,
                                                          I've been
                                                          feeling
                                                          frustrated
                                                          because
                                                          Gluster has a
                                                          lot of
                                                          poorly-documented
                                                          and confusing
                                                          options. I
                                                          really wish
                                                          documentation
                                                          could be
                                                          improved with
                                                          examples and
                                                          better
                                                          explanations.

                                                          Specifically,
                                                          it'd be
                                                          absolutely
                                                          amazing if the
                                                          docs offered a
                                                          strategy for
                                                          setting each
                                                          value and ways
                                                          of determining
                                                          more optimal
                                                          values. For
                                                          example,
                                                          for performance.cache-size,
                                                          if it said
                                                          something like
                                                          "run command
                                                          abc to see
                                                          your current
                                                          cache size,
                                                          and if it's
                                                          hurting, up
                                                          it, but be
                                                          aware that
                                                          it's limited
                                                          by RAM," it'd
                                                          be already a
                                                          huge
                                                          improvement to
                                                          the docs. And
                                                          so on with
                                                          other options.

                                                          The
                                                          gluster team
                                                          is quite
                                                          helpful on
                                                          this mailing
                                                          list, but in a
                                                          reactive
                                                          rather than
                                                          proactive way.
                                                          Perhaps it's
                                                          tunnel vision
                                                          once you've
                                                          worked on a
                                                          project for so
                                                          long where
                                                          less technical
                                                          explanations
                                                          and even
                                                          proper
                                                          documentation
                                                          of options
                                                          takes a back
                                                          seat, but I
                                                          encourage you
                                                          to be more
                                                          proactive
                                                          about helping
                                                          us understand
                                                          and optimize
                                                          Gluster.

                                                          Thank
                                                          you.

                                                          Sincerely,

                                                          Artem

                                                          --

                                                          Founder, Android Police, APK
                                                          Mirror, Illogical Robot LLC
                                                          beerpla.net
                                                          | +ArtemRussakovskii | @ArtemR

                                                          _______________________________________________

                                                          Gluster-users
                                                          mailing list

                                                          Gluster-users@xxxxxxxxxxx

                                                          http://lists.gluster.org/mailman/listinfo/gluster-users

                    _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users