Re: Brick Reboot => VMs slowdown, client crashes

Carl Sirotic <csirotic@xxxxxxxxxxxxxxxxxxxx> · Thu, 29 Aug 2019 16:02:03 -0400



    Yes,
    this makes alot of sense.
    It's the behavior that I was experiencing that makes no sense.
    When one node was shut down, the whole VM cluster locked up.
    However, I managed to find that the culprit were the quorum
      settings.
    I put the quorum at 2 bricks for quorum now, and I am not
      experiencing the problem anymore.
    All my vm boot disks and data disks are now sharded.
    We are on 10gbit networks, when the node comes backs, we do not
      see any latency really.
    

    Carl
    

    On 2019-08-29 3:58 p.m., Darrell Budic
      wrote:

    
      You may be mis-understanding the way the gluster system works in
      detail here, but you’ve got the right idea overall. Since gluster
      is maintaining 3 copies of your data, you can lose a drive or a
      whole system and things will keep going without interruption
      (well, mostly, if a host node was using the system that just died,
      it may pause briefly before re-connecting to one that is still
      running via a backup-server setting or your dns configs). While
      the system is still going with one node down, that node is falling
      behind and new disk writes, and the remaining ones are keeping
      track of what’s changing. Once you repair/recover/reboot the down
      node, it will rejoin the cluster. Now the recovered system has to
      catch up, and it does this by having the other two nodes send it
      the changes. In the meantime, gluster is serving any reads for
      that data from one of the up to date nodes, even if you ask the
      one you just restarted. In order to do this healing, it had to
      lock the files to ensure no changes are made while it copies a
      chunk of them over the recovered node. When it locks them, your
      hypervisor notices they have gone read-only, and especially if it
      has a pending write for that file, may pause the VM because this
      looks like a storage issue to it. Once the file gets unlocked, it
      can be written again, and your hypervisor notices and will
      generally reactivate your VM. You may see delays too, especially
      if you only have 1G networking between your host nodes while
      everything is getting copied around. And your files could be being
      locked, updated, unlocked, locked again a few seconds or minutes
      later, etc.
      

      That’s where sharding comes into play, once you have
        a file broken up into shards, gluster can get away with only
        locking the particular shard it needs to heal, and leaving the
        whole disk image unlocked. You may still catch a brief pause if
        you try and write the specific segment of the file gluster is
        healing at the moment, but it’s also going to be much faster
        because it’s a small chuck of the file, and copies quickly.

        
        Also, check out https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/,
          you probably want to set cluster.server-quorum-ratio to 50 for
          a replica-3 setup to avoid the possibility of split-brains.
          Your cluster will go write only if it loses two nodes though,
          but you can always make a change to the server-quorum-ratio
          later if you need to keep it running temporarily.

          
          Hope that makes sense of what’s going on for you,
          

            -Darrell
          

              On Aug 23, 2019, at 5:06 PM, Carl Sirotic
                <csirotic@xxxxxxxxxxxxxxxxxxxx>
                wrote:
              

                  Okay,
                  so it means, at least I am not getting the
                    expected behavior and there is hope.
                  I put the quorum settings that I was told
                    a couple of emails ago.
                  After applying virt group, they are
                  cluster.quorum-type                    
                    auto                                    

                    cluster.quorum-count                   
                    (null)                                  

                    cluster.server-quorum-type             
                    server                                  

                    cluster.server-quorum-ratio            
                    0                                       

                    cluster.quorum-reads                   
                    no                                      

                    
                  Also,
                  I just put the ping timeout to 5 seconds
                    now.
                  

                    Carl

                  
                  On 2019-08-23 5:45 p.m.,
                    Ingo Fischer wrote:

                  
                    Hi Carl,
                    

                    In my understanding and experience (I
                      have a replica 3 System running too) this should
                      not happen. Can you tell your client and server
                      quorum settings?

                      
                      Ingo
                      

                        Am 23.08.2019 um 15:53 schrieb Carl Sirotic <csirotic@xxxxxxxxxxxxxxxxxxxx>:

                        
                          However,
                          I must have misunderstood the
                            whole concept of gluster.
                          In a replica 3, for me, it's
                            completely unacceptable, regardless of the
                            options, that all my VMs go down when I
                            reboot one node.
                          The whole purpose of having a full
                            3 copy of my data on the fly is suposed to
                            be this.
                          I am in the process of sharding
                            every file.
                          But even if the healing time would
                            be longer, I would still expect a
                            non-sharded replica 3 brick with vm boot
                            disk, to not go down if I reboot one of its
                            copy.
                          

                          I am not very impressed by gluster
                            so far.

                          
                          Carl

                          
                          On 2019-08-19
                            4:15 p.m., Darrell Budic wrote:

                          
                            /var/lib/glusterd/groups/virt is a good
                            start for ideas, notably some thread
                            settings and choose-local=off to improve
                            read performance. If you don’t have at least
                            10 cores on your servers, you may want to
                            lower the recommended shd-max-threads=8 to
                            no more than half your CPU cores to keep
                            healing from swamping out regular work.
                            

                            It’s also starting to depend
                              on what your backing store and networking
                              setup are, so you’re going to want to test
                              changes and find what works best for your
                              setup.
                            

                            In addition to the virt group
                              settings, I use these on most of my
                              volumes, SSD or HDD backed, with the
                              default 64M shard size:
                            

                              performance.io-thread-count:
                                  32		#
                                  seemed good for my system,
                                  particularly a ZFS backed volume with
                                  lots of spindles
                              client.event-threads:
                                  8				
                            
                            
                                cluster.data-self-heal-algorithm:
                                    full	#
                                    10G networking, uses more net/less
                                    cpu to heal. probably don’t use this
                                    for 1G networking?
                                
                                    performance.stat-prefetch:
                                        on
                                    
                                        cluster.read-hash-mode:
                                            3			#
                                            distribute reads to least
                                            loaded server (by read queue
                                            depth)
                                        

                                        and
                                            these two only on my HDD
                                            backed volume:
                                        

                                            performance.cache-size:
                                                1G
                                            performance.write-behind-window-size:
                                                64MB
                                            

                                            but I suspect
                                                these two need another
                                                round or six of tuning
                                                to tell if they are
                                                making a difference.
                                          
                                      
                            I use the
                              throughput-performance tuned profile on my
                              servers, so you should be in good shape
                              there.
                            
                              
                                  On Aug 19, 2019, at
                                    12:22 PM, Guy Boisvert <guy.boisvert@xxxxxxxxxxxxxxxx>
                                    wrote:
                                  

                                    On 2019-08-19 12:08
                                      p.m., Darrell Budic wrote:

                                      You
                                        also need to make sure your
                                        volume is setup properly for
                                        best performance. Did you apply
                                        the gluster virt group to your
                                        volumes, or at least
                                        features.shard = on on your VM
                                        volume?

                                      
                                      That's what we did here:

                                      
                                      gluster volume set W2K16_Rhenium
                                      cluster.quorum-type auto

                                      gluster volume set W2K16_Rhenium
                                      network.ping-timeout 10

                                      gluster volume set W2K16_Rhenium
                                      auth.allow \*

                                      gluster volume set W2K16_Rhenium
                                      group virt

                                      gluster volume set W2K16_Rhenium
                                      storage.owner-uid 36

                                      gluster volume set W2K16_Rhenium
                                      storage.owner-gid 36

                                      gluster volume set W2K16_Rhenium
                                      features.shard on

                                      gluster volume set W2K16_Rhenium
                                      features.shard-block-size 256MB

                                      gluster volume set W2K16_Rhenium
                                      cluster.data-self-heal-algorithm
                                      full

                                      gluster volume set W2K16_Rhenium
                                      performance.low-prio-threads 32

                                      
                                      tuned-adm profile random-io       
                                      (a profile i added in CentOS 7)

                                      
                                      cat
                                      /usr/lib/tuned/random-io/tuned.conf

===========================================

                                      [main]

                                      summary=Optimize for Gluster
                                      virtual machine storage

                                      include=throughput-performance

                                      
                                      [sysctl]

                                      
                                      vm.dirty_ratio = 5

                                      vm.dirty_background_ratio = 2

                                      
                                      Any more optimization to add to
                                      this?

                                      
                                      Guy

                                      
                                      -- 

                                      Guy Boisvert, ing.

                                      IngTegration inc.

                                      http://www.ingtegration.com

                                      https://www.linkedin.com/in/guy-boisvert-8990487

                                      
                                      AVIS DE CONFIDENTIALITE : ce
                                      message peut contenir des

                                      renseignements confidentiels
                                      appartenant exclusivement a

                                      IngTegration Inc. ou a ses
                                      filiales. Si vous n'etes pas

                                      le destinataire indique ou prevu
                                      dans ce  message (ou

                                      responsable de livrer ce message a
                                      la personne indiquee ou

                                      prevue) ou si vous pensez que ce
                                      message vous a ete adresse

                                      par erreur, vous ne pouvez pas
                                      utiliser ou reproduire ce

                                      message, ni le livrer a quelqu'un
                                      d'autre. Dans ce cas, vous

                                      devez le detruire et vous etes
                                      prie d'avertir l'expediteur

                                      en repondant au courriel.

                                      
                                      CONFIDENTIALITY NOTICE :
                                      Proprietary/Confidential
                                      Information

                                      belonging to IngTegration Inc. and
                                      its affiliates may be

                                      contained in this message. If you
                                      are not a recipient

                                      indicated or intended in this
                                      message (or responsible for

                                      delivery of this message to such
                                      person), or you think for

                                      any reason that this message may
                                      have been addressed to you

                                      in error, you may not use or copy
                                      or deliver this message to

                                      anyone else. In such case, you
                                      should destroy this message

                                      and are asked to notify the sender
                                      by reply email.

                                      
                        _______________________________________________

                          Gluster-users mailing list

                          Gluster-users@xxxxxxxxxxx

                          https://lists.gluster.org/mailman/listinfo/gluster-users
                      
                    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users