Re: RE : Frequent connect and disconnect messages flooded in logs

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Tue, 6 Dec 2016 12:45:23 +0530

    On 12/03/2016 12:56 AM, Micha Ober
      wrote:

      ** Update: ** I have downgraded
          from 3.8.6 to 3.7.17 now, but the problem still exists.

        Client log: http://paste.ubuntu.com/23569065/

        Brick log: http://paste.ubuntu.com/23569067/

        Please note that each server has two bricks.

        Whereas, according to the logs, one brick loses the
          connection to all other hosts:

        [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe)
[2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe)
[2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe)
[2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe)
[2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe)

The SECOND brick on the SAME host is NOT affected, i.e. no disconnects!
As I said, the network connection is fine and the disks are idle.
The CPU always has 2 free cores.

It looks like I have to downgrade to 3.4 now in order for the disconnects to stop.

    Hi Micha,

    Thanks for the update and sorry for what happened with gluster
    higher versions. I can understand the need for downgrade as it is a
    production setup.

    Can you tell me the clients used here ? whether it is a
    fuse,nfs,nfs-ganesha, smb or libgfapi ?

    Since I'm not able to reproduce the issue (I have been trying from
    last 3days) and the logs are not much helpful here (we don't have
    much logs in socket layer), Could you please create a dummy cluster
    and try to reproduce the issue? If then we can play with that volume
    and I could provide some debug build which we can use for further
    debugging?

    If you don't have bandwidth for this, please leave it ;).

    Regards

    Rafi KC

- Micha

        Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:

        Hi Micha,
        I have changed the thread and subject so that your original
          thread remain same for your query. Let's try to fix the
          problem what you observed with 3.8.4, So I have started a new
          thread to discuss the frequent disconnect problem.
        If any one else has experienced the same problem, please
            respond to the mail.

        It would be very helpful if you could give us some more logs
          from clients and bricks.  Also any reproducible steps will
          surely help to chase the problem further.
        Regards
        Rafi KC

        On 11/30/2016 04:44 AM, Micha Ober
          wrote:

              I had opened another thread on this mailing
                  list (Subject: "After upgrade from 3.4.2 to 3.8.5 -
                  High CPU usage resulting in disconnects and
                  split-brain").

              The title may be a bit misleading now, as I
                  am no longer observing high CPU usage after upgrading
                  to 3.8.6, but the disconnects are still happening and
                  the number of files in split-brain is growing.

              Setup: 6 compute nodes, each serving as a
                  glusterfs server and client, Ubuntu 14.04, two bricks
                  per node, distribute-replicate

              I have two gluster volumes set up (one for
                  scratch data, one for the slurm scheduler). Only the
                  scratch data volume shows critical errors "[...] has
                  not responded in the last 42 seconds, disconnecting.".
                  So I can rule out network problems, the gigabit link
                  between the nodes is not saturated at all. The disks
                  are almost idle (<10%).

              I have glusterfs 3.4.2 on Ubuntu 12.04 on a
                  another compute cluster, running fine since it was
                  deployed.
              I had glusterfs 3.4.2 on Ubuntu 14.04 on
                  this cluster, running fine for almost a year.

              After upgrading to 3.8.5, the problems (as
                  described) started. I would like to use some of the
                  new features of the newer versions (like bitrot), but
                  the users can't run their compute jobs right now
                  because the result files are garbled.

              There also seems to be a bug report with a
                  smiliar problem: (but no progress)
              https://bugzilla.redhat.com/show_bug.cgi?id=1370683

              For me, ALL servers are affected (not
                  isolated to one or two servers)

              I also see messages like "INFO:
                    task gpu_graphene_bv:4476 blocked for more than 120
                    seconds." in the syslog.

              For completeness (gv0 is the scratch
                  volume, gv2 the slurm volume):

              [root@giant2: ~]# gluster v info

              Volume Name: gv0
              Type: Distributed-Replicate
              Volume ID:
                  993ec7c9-e4bc-44d0-b7c4-2d977e622e86
              Status: Started
              Snapshot Count: 0
              Number of Bricks: 6 x 2 = 12
              Transport-type: tcp
              Bricks:
              Brick1: giant1:/gluster/sdc/gv0
              Brick2: giant2:/gluster/sdc/gv0
              Brick3: giant3:/gluster/sdc/gv0
              Brick4: giant4:/gluster/sdc/gv0
              Brick5: giant5:/gluster/sdc/gv0
              Brick6: giant6:/gluster/sdc/gv0
              Brick7: giant1:/gluster/sdd/gv0
              Brick8: giant2:/gluster/sdd/gv0
              Brick9: giant3:/gluster/sdd/gv0
              Brick10: giant4:/gluster/sdd/gv0
              Brick11: giant5:/gluster/sdd/gv0
              Brick12: giant6:/gluster/sdd/gv0
              Options Reconfigured:
              auth.allow: X.X.X.*,127.0.0.1
              nfs.disable: on

              Volume Name: gv2
              Type: Replicate
              Volume ID:
                  30c78928-5f2c-4671-becc-8deaee1a7a8d
              Status: Started
              Snapshot Count: 0
              Number of Bricks: 1 x 2 = 2
              Transport-type: tcp
              Bricks:
              Brick1: giant1:/gluster/sdd/gv2
              Brick2: giant2:/gluster/sdd/gv2
              Options Reconfigured:
              auth.allow: X.X.X.*,127.0.0.1
              cluster.granular-entry-heal: on
              cluster.locking-scheme: granular
              nfs.disable: on

            2016-11-30 0:10 GMT+01:00 Micha
              Ober <micha2k@xxxxxxxxx>:

                  There also
                    seems to be a bug report with a smiliar problem:
                    (but no progress)
                  https://bugzilla.redhat.com/show_bug.cgi?id=1370683

                  For me, ALL servers are affected (not
                      isolated to one or two servers)

                  I also see messages like "INFO:
                        task gpu_graphene_bv:4476 blocked for more than
                        120 seconds." in the syslog.

                  For completeness (gv0 is the scratch
                      volume, gv2 the slurm volume):

                      [root@giant2: ~]#
                        gluster v info

                      Volume Name: gv0
                      Type:
                        Distributed-Replicate
                      Volume ID:
                        993ec7c9-e4bc-44d0-b7c4-2d977e622e86
                      Status: Started
                      Snapshot Count: 0
                      Number of Bricks: 6 x 2
                        = 12
                      Transport-type: tcp
                      Bricks:
                      Brick1:
                        giant1:/gluster/sdc/gv0
                      Brick2:
                        giant2:/gluster/sdc/gv0
                      Brick3:
                        giant3:/gluster/sdc/gv0
                      Brick4:
                        giant4:/gluster/sdc/gv0
                      Brick5:
                        giant5:/gluster/sdc/gv0
                      Brick6:
                        giant6:/gluster/sdc/gv0
                      Brick7:
                        giant1:/gluster/sdd/gv0
                      Brick8:
                        giant2:/gluster/sdd/gv0
                      Brick9:
                        giant3:/gluster/sdd/gv0
                      Brick10:
                        giant4:/gluster/sdd/gv0
                      Brick11:
                        giant5:/gluster/sdd/gv0
                      Brick12:
                        giant6:/gluster/sdd/gv0
                      Options Reconfigured:
                      auth.allow:
                        X.X.X.*,127.0.0.1
                      nfs.disable: on

                      Volume Name: gv2
                      Type: Replicate
                      Volume ID:
                        30c78928-5f2c-4671-becc-8deaee1a7a8d
                      Status: Started
                      Snapshot Count: 0
                      Number of Bricks: 1 x 2
                        = 2
                      Transport-type: tcp
                      Bricks:
                      Brick1:
                        giant1:/gluster/sdd/gv2
                      Brick2:
                        giant2:/gluster/sdd/gv2
                      Options Reconfigured:
                      auth.allow:
                        X.X.X.*,127.0.0.1
                      cluster.granular-entry-heal:
                        on
                      cluster.locking-scheme:
                        granular
                      nfs.disable: on

                      2016-11-29 19:21
                        GMT+01:00 Micha Ober <micha2k@xxxxxxxxx>:

                            I
                              had opened another thread on this mailing
                              list (Subject: "After upgrade from 3.4.2
                              to 3.8.5 - High CPU usage resulting in
                              disconnects and split-brain").

                            The
                              title may be a bit misleading now, as I am
                              no longer observing high CPU usage after
                              upgrading to 3.8.6, but the disconnects
                              are still happening and the number of
                              files in split-brain is growing.

                            Setup:
                              6 compute nodes, each serving as a
                              glusterfs server and client, Ubuntu 14.04,
                              two bricks per node, distribute-replicate

                            I
                              have two gluster volumes set up (one for
                              scratch data, one for the slurm
                              scheduler). Only the scratch data volume
                              shows critical errors "[...] has not
                              responded in the last 42 seconds,
                              disconnecting.". So I can rule out network
                              problems, the gigabit link between the
                              nodes is not saturated at all. The disks
                              are almost idle (<10%).

                            I
                              have glusterfs 3.4.2 on Ubuntu 12.04 on a
                              another compute cluster, running fine
                              since it was deployed.
                            I
                              had glusterfs 3.4.2 on Ubuntu 14.04 on
                              this cluster, running fine for almost a
                              year.

                            After
                              upgrading to 3.8.5, the problems (as
                              described) started. I would like to use
                              some of the new features of the newer
                              versions (like bitrot), but the users
                              can't run their compute jobs right now
                              because the result files are garbled.

                                2016-11-29
                                  18:53 GMT+01:00 Atin Mukherjee <amukherj@xxxxxxxxxx>:

                                    Would you be able to share what is not working for you in 3.8.x (mention the exact version). 3.4 is quite old and falling back to an unsupported version doesn't look a feasible option.

                                          On Tue, 29 Nov
                                            2016 at 17:01, Micha Ober
                                            <micha2k@xxxxxxxxx> wrote:

                                              Hi,

                                              I was using gluster 3.4 and
                                                upgraded to 3.8, but
                                                that version showed to
                                                be unusable for me. I
                                                now need to downgrade.

                                              I'm running Ubuntu 14.04. As
                                                upgrades of the op
                                                version
                                                are irreversible, I
                                                guess I have to delete
                                                all gluster volumes and
                                                re-create them with the
                                                downgraded version. 

                                              0. Backup data
                                              1. Unmount all gluster volumes
                                              2. apt-get purge
                                                glusterfs-server
                                                glusterfs-client
                                              3. Remove PPA for 3.8
                                              4. Add PPA for older version
                                              5. apt-get install
                                                glusterfs-server
                                                glusterfs-client
                                              6. Create volumes

                                              Is "purge" enough to delete all
                                                configuration files of
                                                the currently installed
                                                version or do I need to
                                                 manually clear some
                                                residues before
                                                installing an older
                                                version?

                                              Thanks.

                                          _______________________________________________

                                          Gluster-users mailing list

                                          Gluster-users@xxxxxxxxxxx

                                          http://www.gluster.org/mailman/listinfo/gluster-users

                                        -- 

                                        -
                                          Atin (atinm)

          _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users