Re: performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs

Ravishankar N <ravishankar@xxxxxxxxxx> · Wed, 18 Apr 2018 10:28:12 +0530



    On 04/18/2018 10:14 AM, Artem
      Russakovskii wrote:

    
      Following up here on a related and very serious for
        us issue.
        

        I took down one of the 4 replicate gluster servers for
          maintenance today. There are 2 gluster volumes totaling about
          600GB. Not that much data. After the server comes back online,
          it starts auto healing and pretty much all operations on
          gluster freeze for many minutes.
        

        For example, I was trying to run an ls -alrt in a folder
          with 7300 files, and it took a good 15-20 minutes before
          returning.
        

        During this time, I can see iostat show 100% utilization on
          the brick, heal status takes many minutes to return,
          glusterfsd uses up tons of CPU (I saw it spike to 600%).
          gluster already has massive performance issues for me, but
          healing after a 4-hour downtime is on another level of bad
          perf.
        

        For example, this command took many minutes to run:
        

          gluster volume heal androidpolice_data3 info summary
          Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
          Status: Connected
          Total Number of entries: 91
          Number of entries in heal pending: 90
          Number of entries in split-brain: 0
          Number of entries possibly healing: 1
          

          Brick forge:/mnt/forge_block4/androidpolice_data3
          Status: Connected
          Total Number of entries: 87
          Number of entries in heal pending: 86
          Number of entries in split-brain: 0
          Number of entries possibly healing: 1
          

          Brick hive:/mnt/hive_block4/androidpolice_data3
          Status: Connected
          Total Number of entries: 87
          Number of entries in heal pending: 86
          Number of entries in split-brain: 0
          Number of entries possibly healing: 1
          

          Brick citadel:/mnt/citadel_block4/androidpolice_data3
          Status: Connected
          Total Number of entries: 0
          Number of entries in heal pending: 0
          Number of entries in split-brain: 0
          Number of entries possibly healing: 0
        
        
        Statistics showed a diminishing number of failed heals:
        ...
        
          Ending time of crawl: Tue Apr 17 21:13:08 2018
          

          Type of crawl: INDEX
          No. of entries healed: 2
          No. of entries in split-brain: 0
          No. of heal failed entries: 102
          

          Starting time of crawl: Tue Apr 17 21:13:09 2018
          

          Ending time of crawl: Tue Apr 17 21:14:30 2018
          

          Type of crawl: INDEX
          No. of entries healed: 4
          No. of entries in split-brain: 0
          No. of heal failed entries: 91
          

          Starting time of crawl: Tue Apr 17 21:14:31 2018
          

          Ending time of crawl: Tue Apr 17 21:15:34 2018
          

          Type of crawl: INDEX
          No. of entries healed: 0
          No. of entries in split-brain: 0
          No. of heal failed entries: 88
        
        ...
        

        Eventually, everything heals and goes back to at least
          where the roof isn't on fire anymore.
        

        The server stats and volume options were given in one of
          the previous replies to this thread.
        

        Any ideas or things I could run and show the output of to
          help diagnose? I'm also very open to working with someone on
          the team on a live debugging session if there's interest.
      
    
    It is likely that self-heal is causing the CPU spike due to the
    flood of lookups/ locks and checksum fops that the self-heal-daemon
    sends to the bricks.

    There's a script to control shd's cpu usage using cgroups. That
    should help in regulating self-heal traffic:
    https://review.gluster.org/#/c/18404/ (see
    extras/control-cpu-load.sh)

    Other self-heal related volume options that you could change are
    setting 'cluster.data-self-heal-algorithm' to 'full' and
    'granular-entry-heal' to 'enable'.  `gluster volume set help` should
    give you more information about these options.

    Thanks,

    Ravi

    
        Thank you.
      
      
                              Sincerely,

                              Artem

                              
                              --

                              Founder, Android
                                Police, APK
                                Mirror,
                                Illogical Robot LLC
                            beerpla.net
                              | +ArtemRussakovskii
                              | @ArtemR

                            
        On Tue, Apr 10, 2018 at 9:56 AM, Artem
          Russakovskii <archon810@xxxxxxxxx>
          wrote:

          
            Hi Vlad,
              

              I actually saw that post already and even asked a
                question 4 days ago (https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917).
                The accepted answer also seems to go against your
                suggestion to enable direct-io-mode as it says it should
                be disabled for better performance when used just for
                file accesses.
              

              It'd be great if someone from the Gluster team chimed
                in about this thread.
            
            
                                      Sincerely,

                                      Artem

                                      
                                      --

                                      Founder, Android
                                        Police, APK
                                        Mirror,
                                        Illogical Robot LLC
                                    beerpla.net
                                      | +ArtemRussakovskii
                                      | @ArtemR

                                    
                  On Tue, Apr 10, 2018 at 7:01
                    AM, Vlad Kopylov <vladkopy@xxxxxxxxx>
                    wrote:

                    
                          Wish I knew or was able to get detailed
                            description of those options myself.

                          
                          here is direct-io-mode 
                            https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode

                          
                          Same as you I ran tests on a large volume of
                          files, finding that main delays are in
                          attribute calls, ending up with those mount
                          options to add performance.

                        
                        I discovered those options through
                          basically googling this user list with people
                          sharing their tests.

                        
                        Not sure I would share your optimism, and rather
                        then going up I downgraded to 3.12 and have no
                        dir view issue now. Though I had to recreate the
                        cluster and had to re-add bricks with existing
                        data.

                      
                            On Tue, Apr 10,
                              2018 at 1:47 AM, Artem Russakovskii <archon810@xxxxxxxxx>
                              wrote:

                              
                                Hi Vlad,
                                  

                                  I'm using only localhost: mounts.
                                  

                                  Can you please explain what
                                    effect each option has on
                                    performance issues shown in my
                                    posts?
                                    "negative-timeout=10,attribute-timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5"
                                    From what I remember,
                                    direct-io-mode=enable didn't make a
                                    difference in my tests, but I
                                    suppose I can try again. The
                                    explanations about direct-io-mode
                                    are quite confusing on the web in
                                    various guides, saying enabling it
                                    could make performance worse in some
                                    situations and better in others due
                                    to OS file cache.
                                  

                                  There are also these gluster
                                    volume settings, adding to the
                                    confusion:
                                  
                                    Option:
                                      performance.strict-o-direct
                                    Default Value: off
                                    Description: This option when
                                      set to off, ignores the O_DIRECT
                                      flag.
                                    

                                    Option:
                                      performance.nfs.strict-o-direct
                                    Default Value: off
                                    Description: This option when
                                      set to off, ignores the O_DIRECT
                                      flag.
                                  
                                  
                                  Re: 4.0. I moved to 4.0 after
                                    finding out that it fixes the
                                    disappearing dirs bug related to
                                    cluster.readdir-optimize if you
                                    remember (http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html).
                                    I was already on 3.13 by then, and
                                    4.0 resolved the issue. It's been
                                    stable for me so far, thankfully.
                                  

                                                          Sincerely,

                                                          Artem

                                                          
                                                          --

                                                          Founder, Android Police, APK
                                                          Mirror, Illogical Robot LLC
                                                          beerpla.net
                                                          | +ArtemRussakovskii | @ArtemR

                                                          
                                        On Mon,
                                          Apr 9, 2018 at 10:38 PM, Vlad
                                          Kopylov <vladkopy@xxxxxxxxx>
                                          wrote:

                                          
                                                you definitely need
                                                  mount options to
                                                  /etc/fstab

                                                
                                                use ones from here http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html

                                                
                                              I went on with using
                                                local mounts to achieve
                                                performance as well

                                              
                                              Also, 3.12 or 3.10
                                              branches would be
                                              preferable for production
                                              

                                                  On
                                                    Fri, Apr 6, 2018 at
                                                    4:12 AM, Artem
                                                    Russakovskii <archon810@xxxxxxxxx> wrote:

                                                  
                                                      Hi
                                                        again,
                                                        

                                                        I'd like to
                                                          expand on the
                                                          performance
                                                          issues and
                                                          plead for
                                                          help. Here's
                                                          one case which
                                                          shows these
                                                          odd hiccups: https://i.imgur.com/CXBPjTK.gifv.
                                                          

                                                          In this
                                                          GIF where I
                                                          switch back
                                                          and forth
                                                          between copy
                                                          operations on
                                                          2 servers, I'm
                                                          copying a 10GB
                                                          dir full of
                                                          .apk and image
                                                          files.
                                                          

                                                          On server
                                                          "hive" I'm
                                                          copying
                                                          straight from
                                                          the main disk
                                                          to an attached
                                                          volume block
                                                          (xfs). As you
                                                          can see, the
                                                          transfers are
                                                          relatively
                                                          speedy and
                                                          don't hiccup.
                                                          On server
                                                          "citadel" I'm
                                                          copying the
                                                          same set of
                                                          data to a
                                                          4-replicate
                                                          gluster which
                                                          uses block
                                                          storage as a
                                                          brick. As you
                                                          can see,
                                                          performance is
                                                          much worse,
                                                          and there are
                                                          frequent
                                                          pauses for
                                                          many seconds
                                                          where nothing
                                                          seems to be
                                                          happening -
                                                          just freezes.

                                                          
                                                          All 4
                                                          servers have
                                                          the same
                                                          specs, and all
                                                          of them have
                                                          performance
                                                          issues with
                                                          gluster and no
                                                          such issues
                                                          when raw xfs
                                                          block storage
                                                          is used.
                                                          

                                                          hive has
                                                          long finished
                                                          copying the
                                                          data, while
                                                          citadel is
                                                          barely
                                                          chugging along
                                                          and is
                                                          expected to
                                                          take probably
                                                          half an hour
                                                          to an hour. I
                                                          have over 1TB
                                                          of data to
                                                          migrate, at
                                                          which point if
                                                          we went live,
                                                          I'm not even
                                                          sure gluster
                                                          would be able
                                                          to keep up
                                                          instead of
                                                          bringing the
                                                          machines and
                                                          services down.
                                                          

                                                          Here's
                                                          the cluster
                                                          config, though
                                                          it didn't seem
                                                          to make any
                                                          difference
                                                          performance-wise
                                                          before I
                                                          applied the
                                                          customizations
                                                          vs after.
                                                          

                                                          Volume
                                                          Name:
                                                          apkmirror_data1
                                                          Type:
                                                          Replicate
                                                          Volume
                                                          ID:
                                                          11ecee7e-d4f8-497a-9994-ceb144d6841e
                                                          Status:
                                                          Started
                                                          Snapshot
                                                          Count: 0
                                                          Number of
                                                          Bricks: 1 x 4
                                                          = 4
                                                          Transport-type:
                                                          tcp
                                                          Bricks:
                                                          Brick1:
                                                          nexus2:/mnt/nexus2_block1/apkmirror_data1
                                                          Brick2:
                                                          forge:/mnt/forge_block1/apkmirror_data1
                                                          Brick3:
                                                          hive:/mnt/hive_block1/apkmirror_data1
                                                          Brick4:
                                                          citadel:/mnt/citadel_block1/apkmirror_data1
                                                          Options
                                                          Reconfigured:
                                                          cluster.quorum-count:
                                                          1
                                                          cluster.quorum-type:
                                                          fixed
                                                          network.ping-timeout:
                                                          5
                                                          network.remote-dio:
                                                          enable
                                                          performance.rda-cache-limit:
                                                          256MB
                                                          performance.readdir-ahead:
                                                          on
                                                          performance.parallel-readdir:
                                                          on
                                                          network.inode-lru-limit:
                                                          500000
                                                          performance.md-cache-timeout:
                                                          600
                                                          performance.cache-invalidation:
                                                          on
                                                          performance.stat-prefetch:
                                                          on
                                                          features.cache-invalidation-timeout:
                                                          600
                                                          features.cache-invalidation:
                                                          on
                                                          cluster.readdir-optimize:
                                                          on
                                                          performance.io-thread-count:
                                                          32
                                                          server.event-threads:
                                                          4
                                                          client.event-threads:
                                                          4
                                                          performance.read-ahead:
                                                          off
                                                          cluster.lookup-optimize:
                                                          on
                                                          performance.cache-size:
                                                          1GB
                                                          cluster.self-heal-daemon:
                                                          enable
                                                          transport.address-family:
                                                          inet
                                                          nfs.disable:
                                                          on
                                                          performance.client-io-threads:
                                                          on
                                                          
                                                          
                                                          The
                                                          mounts are
                                                          done as
                                                          follows in
                                                          /etc/fstab:
                                                          /dev/disk/by-id/scsi-0Linode_Volume_citadel_block1
/mnt/citadel_block1 xfs defaults 0 2

                                                          
                                                          localhost:/apkmirror_data1
/mnt/apkmirror_data1 glusterfs defaults,_netdev 0 0

                                                          
                                                          I'm
                                                          really not
                                                          sure if
                                                          direct-io-mode
                                                          mount tweaks
                                                          would do
                                                          anything here,
                                                          what the value
                                                          should be set
                                                          to, and what
                                                          it is by
                                                          default.
                                                          

                                                          The OS is
                                                          OpenSUSE 42.3,
                                                          64-bit. 80GB
                                                          of RAM, 20
                                                          CPUs, hosted
                                                          by Linode.
                                                          

                                                          I'd
                                                          really
                                                          appreciate any
                                                          help in the
                                                          matter. 
                                                          

                                                          Thank
                                                          you.
                                                        
                                                      
                                                          Sincerely,

                                                          Artem

                                                          
                                                          --

                                                          Founder, Android Police, APK
                                                          Mirror, Illogical Robot LLC
                                                          beerpla.net
                                                          | +ArtemRussakovskii | @ArtemR

                                                          
                                                          On
                                                          Thu, Apr 5,
                                                          2018 at 11:13
                                                          PM, Artem
                                                          Russakovskii <archon810@xxxxxxxxx>
                                                          wrote:

                                                          
                                                          Hi,
                                                          

                                                          I'm
                                                          trying to
                                                          squeeze
                                                          performance
                                                          out of gluster
                                                          on 4 80GB RAM
                                                          20-CPU
                                                          machines where
                                                          Gluster runs
                                                          on attached
                                                          block storage
                                                          (Linode) in (4
                                                          replicate
                                                          bricks), and
                                                          so far
                                                          everything I
                                                          tried results
                                                          in sub-optimal
                                                          performance.
                                                          

                                                          There are
                                                          many files -
                                                          mostly images,
                                                          several
                                                          million - and
                                                          many
                                                          operations
                                                          take minutes,
                                                          copying
                                                          multiple files
                                                          (even if
                                                          they're small)
                                                          suddenly
                                                          freezes up for
                                                          seconds at a
                                                          time, then
                                                          continues,
                                                          iostat
                                                          frequently
                                                          shows large
                                                          r_await and
                                                          w_awaits with
                                                          100%
                                                          utilization
                                                          for the
                                                          attached block
                                                          device, etc.
                                                          

                                                          But
                                                          anyway, there
                                                          are many
                                                          guides out
                                                          there for
                                                          small-file
                                                          performance
                                                          improvements,
                                                          but more
                                                          explanation is
                                                          needed, and I
                                                          think more
                                                          tweaks should
                                                          be possible.

                                                          
                                                          My
                                                          question today
                                                          is
                                                          about performance.cache-size.
                                                          Is this a size
                                                          of cache in
                                                          RAM? If so,
                                                          how do I view
                                                          the current
                                                          cache size to
                                                          see if it gets
                                                          full and I
                                                          should
                                                          increase its
                                                          size? Is it
                                                          advisable to
                                                          bump it up if
                                                          I have many
                                                          tens of gigs
                                                          of RAM free? 
                                                          

                                                          More
                                                          generally, in
                                                          the last 2
                                                          months since I
                                                          first started
                                                          working with
                                                          gluster and
                                                          set a
                                                          production
                                                          system live,
                                                          I've been
                                                          feeling
                                                          frustrated
                                                          because
                                                          Gluster has a
                                                          lot of
                                                          poorly-documented
                                                          and confusing
                                                          options. I
                                                          really wish
                                                          documentation
                                                          could be
                                                          improved with
                                                          examples and
                                                          better
                                                          explanations.
                                                          

                                                          Specifically,
                                                          it'd be
                                                          absolutely
                                                          amazing if the
                                                          docs offered a
                                                          strategy for
                                                          setting each
                                                          value and ways
                                                          of determining
                                                          more optimal
                                                          values. For
                                                          example,
                                                          for performance.cache-size,
                                                          if it said
                                                          something like
                                                          "run command
                                                          abc to see
                                                          your current
                                                          cache size,
                                                          and if it's
                                                          hurting, up
                                                          it, but be
                                                          aware that
                                                          it's limited
                                                          by RAM," it'd
                                                          be already a
                                                          huge
                                                          improvement to
                                                          the docs. And
                                                          so on with
                                                          other options.
                                                          
                                                          
                                                          The
                                                          gluster team
                                                          is quite
                                                          helpful on
                                                          this mailing
                                                          list, but in a
                                                          reactive
                                                          rather than
                                                          proactive way.
                                                          Perhaps it's
                                                          tunnel vision
                                                          once you've
                                                          worked on a
                                                          project for so
                                                          long where
                                                          less technical
                                                          explanations
                                                          and even
                                                          proper
                                                          documentation
                                                          of options
                                                          takes a back
                                                          seat, but I
                                                          encourage you
                                                          to be more
                                                          proactive
                                                          about helping
                                                          us understand
                                                          and optimize
                                                          Gluster.
                                                          

                                                          Thank
                                                          you.
                                                          

                                                          Sincerely,

                                                          Artem

                                                          
                                                          --

                                                          Founder, Android Police, APK
                                                          Mirror, Illogical Robot LLC
                                                          beerpla.net
                                                          | +ArtemRussakovskii | @ArtemR

                                                          
                                                  _______________________________________________

                                                    Gluster-users
                                                    mailing list

                                                    Gluster-users@xxxxxxxxxxx

                                                    http://lists.gluster.org/mailman/listinfo/gluster-users

                                                  
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users