RE : Frequent connect and disconnect messages flooded in logs

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Wed, 30 Nov 2016 11:27:56 +0530



    Hi Micha,
    I have changed the thread and subject so that your original
      thread remain same for your query. Let's try to fix the problem
      what you observed with 3.8.4, So I have started a new thread to
      discuss the frequent disconnect problem.
    If any one else has experienced the same problem, please
        respond to the mail.

    
    It would be very helpful if you could give us some more logs from
      clients and bricks.  Also any reproducible steps will surely help
      to chase the problem further.
    Regards
    Rafi KC

    
    On 11/30/2016 04:44 AM, Micha Ober
      wrote:

    
          I
              had opened another thread on this mailing list (Subject:
              "After upgrade from 3.4.2 to 3.8.5 - High CPU usage
              resulting in disconnects and split-brain").
          

          The
              title may be a bit misleading now, as I am no longer
              observing high CPU usage after upgrading to 3.8.6, but the
              disconnects are still happening and the number of files in
              split-brain is growing.
          

          Setup:
              6 compute nodes, each serving as a glusterfs server and
              client, Ubuntu 14.04, two bricks per node,
              distribute-replicate
          

          I
              have two gluster volumes set up (one for scratch data, one
              for the slurm scheduler). Only the scratch data volume
              shows critical errors "[...] has not responded in the last
              42 seconds, disconnecting.". So I can rule out network
              problems, the gigabit link between the nodes is not
              saturated at all. The disks are almost idle (<10%).
          

          I
              have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
              cluster, running fine since it was deployed.
          I
              had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
              running fine for almost a year.
          

          After
              upgrading to 3.8.5, the problems (as described) started. I
              would like to use some of the new features of the newer
              versions (like bitrot), but the users can't run their
              compute jobs right now because the result files are
              garbled.
          

          There
              also seems to be a bug report with a smiliar problem: (but
              no progress)
          https://bugzilla.redhat.com/show_bug.cgi?id=1370683
          

          For
              me, ALL servers are affected (not isolated to one or two
              servers)
          

          I
              also see messages like "INFO: task gpu_graphene_bv:4476
              blocked for more than 120 seconds." in the syslog.
          

          For
              completeness (gv0 is the scratch volume, gv2 the slurm
              volume):
          

          [root@giant2:
              ~]# gluster v info
          

          Volume
              Name: gv0
          Type:
              Distributed-Replicate
          Volume
              ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
          Status:
              Started
          Snapshot
              Count: 0
          Number
              of Bricks: 6 x 2 = 12
          Transport-type:
              tcp
          Bricks:
          Brick1:
              giant1:/gluster/sdc/gv0
          Brick2:
              giant2:/gluster/sdc/gv0
          Brick3:
              giant3:/gluster/sdc/gv0
          Brick4:
              giant4:/gluster/sdc/gv0
          Brick5:
              giant5:/gluster/sdc/gv0
          Brick6:
              giant6:/gluster/sdc/gv0
          Brick7:
              giant1:/gluster/sdd/gv0
          Brick8:
              giant2:/gluster/sdd/gv0
          Brick9:
              giant3:/gluster/sdd/gv0
          Brick10:
              giant4:/gluster/sdd/gv0
          Brick11:
              giant5:/gluster/sdd/gv0
          Brick12:
              giant6:/gluster/sdd/gv0
          Options
              Reconfigured:
          auth.allow:
              X.X.X.*,127.0.0.1
          nfs.disable:
              on
          

          Volume
              Name: gv2
          Type:
              Replicate
          Volume
              ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
          Status:
              Started
          Snapshot
              Count: 0
          Number
              of Bricks: 1 x 2 = 2
          Transport-type:
              tcp
          Bricks:
          Brick1:
              giant1:/gluster/sdd/gv2
          Brick2:
              giant2:/gluster/sdd/gv2
          Options
              Reconfigured:
          auth.allow:
              X.X.X.*,127.0.0.1
          cluster.granular-entry-heal:
              on
          cluster.locking-scheme:
              granular
          nfs.disable:
              on
          

        2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k@xxxxxxxxx>:

          
              There also seems
                to be a bug report with a smiliar problem: (but no
                progress)
              https://bugzilla.redhat.com/show_bug.cgi?id=1370683

              
              For me, ALL servers are affected (not
                  isolated to one or two servers)
              

              I also see messages like "INFO: task
                  gpu_graphene_bv:4476 blocked for more than 120
                  seconds." in the syslog.
              

              For completeness (gv0 is the scratch
                  volume, gv2 the slurm volume):
              

                  [root@giant2: ~]# gluster v
                    info
                  

                  Volume Name: gv0
                  Type: Distributed-Replicate
                  Volume ID:
                    993ec7c9-e4bc-44d0-b7c4-2d977e622e86
                  Status: Started
                  Snapshot Count: 0
                  Number of Bricks: 6 x 2 =
                    12
                  Transport-type: tcp
                  Bricks:
                  Brick1:
                    giant1:/gluster/sdc/gv0
                  Brick2:
                    giant2:/gluster/sdc/gv0
                  Brick3:
                    giant3:/gluster/sdc/gv0
                  Brick4:
                    giant4:/gluster/sdc/gv0
                  Brick5:
                    giant5:/gluster/sdc/gv0
                  Brick6:
                    giant6:/gluster/sdc/gv0
                  Brick7:
                    giant1:/gluster/sdd/gv0
                  Brick8:
                    giant2:/gluster/sdd/gv0
                  Brick9:
                    giant3:/gluster/sdd/gv0
                  Brick10:
                    giant4:/gluster/sdd/gv0
                  Brick11:
                    giant5:/gluster/sdd/gv0
                  Brick12:
                    giant6:/gluster/sdd/gv0
                  Options Reconfigured:
                  auth.allow:
                    X.X.X.*,127.0.0.1
                  nfs.disable: on
                  

                  Volume Name: gv2
                  Type: Replicate
                  Volume ID:
                    30c78928-5f2c-4671-becc-8deaee1a7a8d
                  Status: Started
                  Snapshot Count: 0
                  Number of Bricks: 1 x 2 = 2
                  Transport-type: tcp
                  Bricks:
                  Brick1:
                    giant1:/gluster/sdd/gv2
                  Brick2:
                    giant2:/gluster/sdd/gv2
                  Options Reconfigured:
                  auth.allow:
                    X.X.X.*,127.0.0.1
                  cluster.granular-entry-heal:
                    on
                  cluster.locking-scheme:
                    granular
                  nfs.disable: on
                  

                  2016-11-29 19:21 GMT+01:00
                    Micha Ober <micha2k@xxxxxxxxx>:

                    
                        I had
                          opened another thread on this mailing list
                          (Subject: "After upgrade from 3.4.2 to 3.8.5 -
                          High CPU usage resulting in disconnects and
                          split-brain").
                        

                        The
                          title may be a bit misleading now, as I am no
                          longer observing high CPU usage after
                          upgrading to 3.8.6, but the disconnects are
                          still happening and the number of files in
                          split-brain is growing.

                        
                        Setup:
                          6 compute nodes, each serving as a glusterfs
                          server and client, Ubuntu 14.04, two bricks
                          per node, distribute-replicate
                        

                        I have
                          two gluster volumes set up (one for scratch
                          data, one for the slurm scheduler). Only the
                          scratch data volume shows critical errors
                          "[...] has not responded in the last 42
                          seconds, disconnecting.". So I can rule out
                          network problems, the gigabit link between the
                          nodes is not saturated at all. The disks are
                          almost idle (<10%).
                        

                        I have
                          glusterfs 3.4.2 on Ubuntu 12.04 on a another
                          compute cluster, running fine since it was
                          deployed.
                        I had
                          glusterfs 3.4.2 on Ubuntu 14.04 on this
                          cluster, running fine for almost a year.
                        

                        After
                          upgrading to 3.8.5, the problems (as
                          described) started. I would like to use some
                          of the new features of the newer versions
                          (like bitrot), but the users can't run their
                          compute jobs right now because the result
                          files are garbled.
                      
                      
                            2016-11-29 18:53
                              GMT+01:00 Atin Mukherjee <amukherj@xxxxxxxxxx>:

                              
                                Would you be able to share what is not working for you in 3.8.x (mention the exact version). 3.4 is quite old and falling back to an unsupported version doesn't look a feasible option.
                                

                                      On Tue, 29 Nov 2016
                                        at 17:01, Micha Ober <micha2k@xxxxxxxxx>
                                        wrote:

                                      
                                          Hi,
                                          

                                          I was using gluster 3.4 and
                                            upgraded to 3.8, but that
                                            version showed to be
                                            unusable for me. I now need
                                            to downgrade.
                                          

                                          I'm running Ubuntu 14.04. As
                                            upgrades of the op version
                                            are irreversible, I guess I
                                            have to delete all gluster
                                            volumes and re-create them
                                            with the downgraded
                                            version. 
                                          

                                          0. Backup data
                                          1. Unmount all gluster volumes
                                          2. apt-get purge
                                            glusterfs-server
                                            glusterfs-client
                                          3. Remove PPA for 3.8
                                          4. Add PPA for older version
                                          5. apt-get install
                                            glusterfs-server
                                            glusterfs-client
                                          6. Create volumes
                                          

                                          Is "purge" enough to delete all
                                            configuration files of the
                                            currently installed version
                                            or do I need to  manually
                                            clear some residues before
                                            installing an older version?
                                          

                                          Thanks.
                                        
                                      
                                      _______________________________________________

                                      Gluster-users mailing list

                                      Gluster-users@xxxxxxxxxxx

                                      http://www.gluster.org/mailman/listinfo/gluster-users
                                
                                
                                    -- 

                                    
                                    -
                                      Atin (atinm)
                                  
                            
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users