Re: RE : Frequent connect and disconnect messages flooded in logs

Micha Ober <micha2k@xxxxxxxxx> · Wed, 30 Nov 2016 18:46:51 +0100



    Hi,

      
      as 6 servers with 12 bricks produce a lot of log files, I have
      uploaded the last 200 lines of a client on one server here:

      http://paste.ubuntu.com/23558816/

      
      When greping the C(ritical) messages, there is for example this
      one:

      [2016-11-30 12:01:06.813333] C
      [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-gv0-client-3:
      server X.X.X.107:49154 has not responded in the last 42 seconds,
      disconnecting.

      
      For client-3, which is giant4:/gluster/sdc/gv0, I have uploaded
      the log for this brick here:

      http://paste.ubuntu.com/23558818/

      
      It's hard to tell how to reproduce this issue other than "put load
      on the servers/clients".

      There are GPGPU compute jobs running on the nodes, but those only
      consume 4 of the 6 CPU cores.

      The servers are not overloaded. All of them have 16 GB RAM and
      most of it is empty.

      The load on the disks is also very small (<10% according to
      iostat)

      
      What other information/logs can I provide?

      
      Thanks,

      Micha

      
      Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:

    
      Hi Micha,
      I have changed the thread and subject so that your original
        thread remain same for your query. Let's try to fix the problem
        what you observed with 3.8.4, So I have started a new thread to
        discuss the frequent disconnect problem.
      If any one else has experienced the same problem, please
          respond to the mail.

      
      It would be very helpful if you could give us some more logs
        from clients and bricks.  Also any reproducible steps will
        surely help to chase the problem further.
      Regards
      Rafi KC

      
      On 11/30/2016 04:44 AM, Micha Ober
        wrote:

      
            I
                had opened another thread on this mailing list (Subject:
                "After upgrade from 3.4.2 to 3.8.5 - High CPU usage
                resulting in disconnects and split-brain").
            

            The
                title may be a bit misleading now, as I am no longer
                observing high CPU usage after upgrading to 3.8.6, but
                the disconnects are still happening and the number of
                files in split-brain is growing.
            

            Setup:
                6 compute nodes, each serving as a glusterfs server and
                client, Ubuntu 14.04, two bricks per node,
                distribute-replicate
            

            I
                have two gluster volumes set up (one for scratch data,
                one for the slurm scheduler). Only the scratch data
                volume shows critical errors "[...] has not responded in
                the last 42 seconds, disconnecting.". So I can rule out
                network problems, the gigabit link between the nodes is
                not saturated at all. The disks are almost idle
                (<10%).
            

            I
                have glusterfs 3.4.2 on Ubuntu 12.04 on a another
                compute cluster, running fine since it was deployed.
            I
                had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
                running fine for almost a year.
            

            After
                upgrading to 3.8.5, the problems (as described) started.
                I would like to use some of the new features of the
                newer versions (like bitrot), but the users can't run
                their compute jobs right now because the result files
                are garbled.
            

            There
                also seems to be a bug report with a smiliar problem:
                (but no progress)
            https://bugzilla.redhat.com/show_bug.cgi?id=1370683
            

            For
                me, ALL servers are affected (not isolated to one or two
                servers)
            

            I
                also see messages like "INFO:
                  task gpu_graphene_bv:4476 blocked for more than 120
                  seconds." in the syslog.
            

            For
                completeness (gv0 is the scratch volume, gv2 the slurm
                volume):
            

            [root@giant2:
                ~]# gluster v info
            

            Volume
                Name: gv0
            Type:
                Distributed-Replicate
            Volume
                ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
            Status:
                Started
            Snapshot
                Count: 0
            Number
                of Bricks: 6 x 2 = 12
            Transport-type:
                tcp
            Bricks:
            Brick1:
                giant1:/gluster/sdc/gv0
            Brick2:
                giant2:/gluster/sdc/gv0
            Brick3:
                giant3:/gluster/sdc/gv0
            Brick4:
                giant4:/gluster/sdc/gv0
            Brick5:
                giant5:/gluster/sdc/gv0
            Brick6:
                giant6:/gluster/sdc/gv0
            Brick7:
                giant1:/gluster/sdd/gv0
            Brick8:
                giant2:/gluster/sdd/gv0
            Brick9:
                giant3:/gluster/sdd/gv0
            Brick10:
                giant4:/gluster/sdd/gv0
            Brick11:
                giant5:/gluster/sdd/gv0
            Brick12:
                giant6:/gluster/sdd/gv0
            Options
                Reconfigured:
            auth.allow:
                X.X.X.*,127.0.0.1
            nfs.disable:
                on
            

            Volume
                Name: gv2
            Type:
                Replicate
            Volume
                ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
            Status:
                Started
            Snapshot
                Count: 0
            Number
                of Bricks: 1 x 2 = 2
            Transport-type:
                tcp
            Bricks:
            Brick1:
                giant1:/gluster/sdd/gv2
            Brick2:
                giant2:/gluster/sdd/gv2
            Options
                Reconfigured:
            auth.allow:
                X.X.X.*,127.0.0.1
            cluster.granular-entry-heal:
                on
            cluster.locking-scheme:
                granular
            nfs.disable:
                on
            

          2016-11-30 0:10 GMT+01:00 Micha Ober
            <micha2k@xxxxxxxxx>:

            
                There also
                  seems to be a bug report with a smiliar problem: (but
                  no progress)
                https://bugzilla.redhat.com/show_bug.cgi?id=1370683

                
                For me, ALL servers are affected (not
                    isolated to one or two servers)
                

                I also see messages like "INFO:
                      task gpu_graphene_bv:4476 blocked for more than
                      120 seconds." in the syslog.
                

                For completeness (gv0 is the scratch
                    volume, gv2 the slurm volume):
                

                    [root@giant2: ~]# gluster
                      v info
                    

                    Volume Name: gv0
                    Type:
                      Distributed-Replicate
                    Volume ID:
                      993ec7c9-e4bc-44d0-b7c4-2d977e622e86
                    Status: Started
                    Snapshot Count: 0
                    Number of Bricks: 6 x 2 =
                      12
                    Transport-type: tcp
                    Bricks:
                    Brick1:
                      giant1:/gluster/sdc/gv0
                    Brick2:
                      giant2:/gluster/sdc/gv0
                    Brick3:
                      giant3:/gluster/sdc/gv0
                    Brick4:
                      giant4:/gluster/sdc/gv0
                    Brick5:
                      giant5:/gluster/sdc/gv0
                    Brick6:
                      giant6:/gluster/sdc/gv0
                    Brick7:
                      giant1:/gluster/sdd/gv0
                    Brick8:
                      giant2:/gluster/sdd/gv0
                    Brick9:
                      giant3:/gluster/sdd/gv0
                    Brick10:
                      giant4:/gluster/sdd/gv0
                    Brick11:
                      giant5:/gluster/sdd/gv0
                    Brick12:
                      giant6:/gluster/sdd/gv0
                    Options Reconfigured:
                    auth.allow:
                      X.X.X.*,127.0.0.1
                    nfs.disable: on
                    

                    Volume Name: gv2
                    Type: Replicate
                    Volume ID:
                      30c78928-5f2c-4671-becc-8deaee1a7a8d
                    Status: Started
                    Snapshot Count: 0
                    Number of Bricks: 1 x 2 =
                      2
                    Transport-type: tcp
                    Bricks:
                    Brick1:
                      giant1:/gluster/sdd/gv2
                    Brick2:
                      giant2:/gluster/sdd/gv2
                    Options Reconfigured:
                    auth.allow:
                      X.X.X.*,127.0.0.1
                    cluster.granular-entry-heal:
                      on
                    cluster.locking-scheme:
                      granular
                    nfs.disable: on
                    

                    2016-11-29 19:21 GMT+01:00
                      Micha Ober <micha2k@xxxxxxxxx>:

                      
                          I
                            had opened another thread on this mailing
                            list (Subject: "After upgrade from 3.4.2 to
                            3.8.5 - High CPU usage resulting in
                            disconnects and split-brain").
                          

                          The
                            title may be a bit misleading now, as I am
                            no longer observing high CPU usage after
                            upgrading to 3.8.6, but the disconnects are
                            still happening and the number of files in
                            split-brain is growing.

                          
                          Setup:
                            6 compute nodes, each serving as a glusterfs
                            server and client, Ubuntu 14.04, two bricks
                            per node, distribute-replicate
                          

                          I
                            have two gluster volumes set up (one for
                            scratch data, one for the slurm scheduler).
                            Only the scratch data volume shows critical
                            errors "[...] has not responded in the last
                            42 seconds, disconnecting.". So I can rule
                            out network problems, the gigabit link
                            between the nodes is not saturated at all.
                            The disks are almost idle (<10%).
                          

                          I
                            have glusterfs 3.4.2 on Ubuntu 12.04 on a
                            another compute cluster, running fine since
                            it was deployed.
                          I
                            had glusterfs 3.4.2 on Ubuntu 14.04 on this
                            cluster, running fine for almost a year.
                          

                          After
                            upgrading to 3.8.5, the problems (as
                            described) started. I would like to use some
                            of the new features of the newer versions
                            (like bitrot), but the users can't run their
                            compute jobs right now because the result
                            files are garbled.
                        
                        
                              2016-11-29 18:53
                                GMT+01:00 Atin Mukherjee <amukherj@xxxxxxxxxx>:

                                
                                  Would you be able to share what is not working for you in 3.8.x (mention the exact version). 3.4 is quite old and falling back to an unsupported version doesn't look a feasible option.
                                  

                                        On Tue, 29 Nov
                                          2016 at 17:01, Micha Ober <micha2k@xxxxxxxxx> wrote:

                                        
                                            Hi,
                                            

                                            I was using gluster 3.4 and
                                              upgraded to 3.8, but that
                                              version showed to be
                                              unusable for me. I now
                                              need to downgrade.
                                            

                                            I'm running Ubuntu 14.04. As
                                              upgrades of the op version
                                              are irreversible, I guess
                                              I have to delete all
                                              gluster volumes and
                                              re-create them with the
                                              downgraded version. 
                                            

                                            0. Backup data
                                            1. Unmount all gluster volumes
                                            2. apt-get purge
                                              glusterfs-server
                                              glusterfs-client
                                            3. Remove PPA for 3.8
                                            4. Add PPA for older version
                                            5. apt-get install
                                              glusterfs-server
                                              glusterfs-client
                                            6. Create volumes
                                            

                                            Is "purge" enough to delete all
                                              configuration files of the
                                              currently installed
                                              version or do I need to
                                               manually clear some
                                              residues before installing
                                              an older version?
                                            

                                            Thanks.
                                          
                                        
                                        _______________________________________________

                                        Gluster-users mailing list

                                        Gluster-users@xxxxxxxxxxx

                                        http://www.gluster.org/mailman/listinfo/gluster-users
                                  
                                  
                                      -- 

                                      
                                      -
                                        Atin (atinm)
                                    
                              
        _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
      
      
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users