Re: Gluster 3.7.6 add new node state Peer Rejected (Connected)

Rafi Kavungal Chundattu Parambil <rkavunga@xxxxxxxxxx> · Mon, 29 Feb 2016 15:12:52 -0500 (EST)

You have to figure out the difference in volinfo in the peers . and rectify it. Or simply you can reduce version in vol info by one in node3 and restating the glusterd will solve the problem.
But I would be more interested to figure out why glusterd crashed.
1) Can you paste back trace of the core generated.

2) Can you paste the op-version for all the nodes.

3) Can you mention steps you did that lead to crash? Seems like you added a brick .

4) If possible can you recollect the order in which you added the peers and the version. Also the upgrade sequence.

May be you can race a bug in bugzilla with the information.
Regards

Rafi KC
On 1 Mar 2016 12:58 am, Steve Dainard <sdainard@xxxxxxxx> wrote:
I changed quota-version=1 on the two new nodes, and was able to join the cluster. I also rebooted the two new nodes and everything came up correctly.
Then I triggered a rebalance fix-layout and one of the original cluster members (node gluster03) glusterd crashed. I restarted glusterd and was connected but after a few minutes I'm left with:

# gluster peer status
Number of Peers: 5

Hostname: 10.0.231.51
Uuid: b01de59a-4428-486b-af49-cb486ab44a07
State: Peer in Cluster (Connected)

Hostname: 10.0.231.52
Uuid: 75143760-52a3-4583-82bb-a9920b283dac
State: Peer Rejected (Connected)

Hostname: 10.0.231.53
Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
State: Peer in Cluster (Connected)

Hostname: 10.0.231.54
Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
State: Peer in Cluster (Connected)

Hostname: 10.0.231.55
Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3
State: Peer in Cluster (Connected)

I see in the logs (attached) there is now a cksum error:

[2016-02-29 19:16:42.082256] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.55
[2016-02-29 19:16:42.082298] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.55 (0), ret: 0
[2016-02-29 19:16:42.092535] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411, host: 10.0.231.53, port: 0
[2016-02-29 19:16:42.096036] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-export-domain-storage/export-domain-storage on port 49153
[2016-02-29 19:16:42.097296] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-vm-storage/vm-storage on port 49155
[2016-02-29 19:16:42.100727] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.108495] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
[2016-02-29 19:16:42.109295] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.53
[2016-02-29 19:16:42.109338] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.53 (0), ret: 0
[2016-02-29 19:16:42.119521] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-env-modules/env-modules on port 49157
[2016-02-29 19:16:42.122856] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/raid6-storage/storage on port 49156
[2016-02-29 19:16:42.508104] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: b01de59a-4428-486b-af49-cb486ab44a07, host: 10.0.231.51, port: 0
[2016-02-29 19:16:42.519403] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.524353] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b01de59a-4428-486b-af49-cb486ab44a07
[2016-02-29 19:16:42.524999] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.51
[2016-02-29 19:16:42.525038] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.51 (0), ret: 0
[2016-02-29 19:16:42.592523] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c, host: 10.0.231.54, port: 0
[2016-02-29 19:16:42.599518] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.604821] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
[2016-02-29 19:16:42.605458] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.54
[2016-02-29 19:16:42.605492] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.54 (0), ret: 0
[2016-02-29 19:16:42.621943] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.628443] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a965e782-39e2-41cc-a0d1-b32ecccdcd2f
[2016-02-29 19:16:42.629079] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.50

On gluster01/02/04/05
/var/lib/glusterd/vols/storage/cksum info=998305000

On gluster03
/var/lib/glusterd/vols/storage/cksum info=998305001

How do I recover from this? Can I just stop glusterd on gluster03 and change the cksum value?

On Thu, Feb 25, 2016 at 12:49 PM, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote:

    On 02/26/2016 01:53 AM, Mohammed Rafi K
      C wrote:

      On 02/26/2016 01:32 AM, Steve Dainard
        wrote:

          I haven't done anything more than peer thus far, so I'm a
            bit confused as to how the volume info fits in, can you
            expand on this a bit?

          Failed commits? Is this split brain on the replica
            volumes? I don't get any return from 'gluster volume heal
            <volname> info' on all the replica volumes, but if I
            try a gluster volume heal <volname> full I get:
            'Launching heal operation to perform full self heal on
            volume <volname> has been unsuccessful'.

      forget about this. it is not for metadata selfheal .

          I have 5 volumes total.

          'Replica 3' volumes running on gluster01/02/03:
          vm-storage
          iso-storage
          export-domain-storage
          env-modules

          And one distributed only volume 'storage' info shown
            below:

            From existing host gluster01/02:

              type=0
              count=4
              status=1
              sub_count=0
              stripe_count=1
              replica_count=1
              disperse_count=0
              redundancy_count=0
              version=25
              transport-type=0
              volume-id=26d355cb-c486-481f-ac16-e25390e73775
              username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c
              password=
              op-version=3
              client-op-version=3
              quota-version=1
              parent_volname=N/A
              restored_from_snap=00000000-0000-0000-0000-000000000000
              snap-max-hard-limit=256
              features.quota-deem-statfs=on
              features.inode-quota=on
              diagnostics.brick-log-level=WARNING
              features.quota=on
              performance.readdir-ahead=on
              performance.cache-size=1GB
              performance.stat-prefetch=on
              brick-0=10.0.231.50:-mnt-raid6-storage-storage
              brick-1=10.0.231.51:-mnt-raid6-storage-storage
              brick-2=10.0.231.52:-mnt-raid6-storage-storage
              brick-3=10.0.231.53:-mnt-raid6-storage-storage

              From existing host gluster03/04:

                type=0
                count=4
                status=1
                sub_count=0
                stripe_count=1
                replica_count=1
                disperse_count=0
                redundancy_count=0
                version=25
                transport-type=0
                volume-id=26d355cb-c486-481f-ac16-e25390e73775
                username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c
                password=
                op-version=3
                client-op-version=3
                quota-version=1
                parent_volname=N/A
                restored_from_snap=00000000-0000-0000-0000-000000000000
                snap-max-hard-limit=256
                features.quota-deem-statfs=on
                features.inode-quota=on
                performance.stat-prefetch=on
                performance.cache-size=1GB
                performance.readdir-ahead=on
                features.quota=on
                diagnostics.brick-log-level=WARNING
                brick-0=10.0.231.50:-mnt-raid6-storage-storage
                brick-1=10.0.231.51:-mnt-raid6-storage-storage
                brick-2=10.0.231.52:-mnt-raid6-storage-storage
                brick-3=10.0.231.53:-mnt-raid6-storage-storage

              So far between gluster01/02 and gluster03/04 the
                configs are the same, although the ordering is different
                for some of the features.

              On gluster05/06 the ordering is different again, and
                the quota-version=0 instead of 1.

      This is why the peer shows as rejected. Can you check the
      op-version of all the glusterd including the one which is in
      reject state. you can find out the op-version here in 
      /var/lib/glusterd/glusterd.info 

    If all the op-version are same and 3.7.6, then to work-around the
    issue, you can manually make it quota-version=1, and restarting the
    glusterd will solve the problem, But I would strongly recommend you
    to figure out the RCA. May be you can file a bug for this.

    Rafi

      Rafi KC

              From new hosts gluster05/gluster06:
              type=0
              count=4
              status=1
              sub_count=0
              stripe_count=1
              replica_count=1
              disperse_count=0
              redundancy_count=0
              version=25
              transport-type=0
              volume-id=26d355cb-c486-481f-ac16-e25390e73775
              username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c
              password=
              op-version=3
              client-op-version=3
              quota-version=0
              parent_volname=N/A
              restored_from_snap=00000000-0000-0000-0000-000000000000
              snap-max-hard-limit=256
              performance.stat-prefetch=on
              performance.cache-size=1GB
              performance.readdir-ahead=on
              features.quota=on
              diagnostics.brick-log-level=WARNING
              features.inode-quota=on
              features.quota-deem-statfs=on
              brick-0=10.0.231.50:-mnt-raid6-storage-storage
              brick-1=10.0.231.51:-mnt-raid6-storage-storage
              brick-2=10.0.231.52:-mnt-raid6-storage-storage
              brick-3=10.0.231.53:-mnt-raid6-storage-storage

          Also, I forgot to mention that when I initially peer'd
            the two new hosts, glusterd crashed on gluster03 and had to
            be restarted (log attached) but has been fine since.

          Thanks,
          Steve

          On Thu, Feb 25, 2016 at 11:27 AM,
            Mohammed Rafi K C <rkavunga@xxxxxxxxxx>
            wrote:

                  On 02/25/2016 11:45 PM, Steve Dainard wrote:

                    Hello,

                      I upgraded from 3.6.6 to 3.7.6 a couple weeks ago.
                      I just peered 2 new nodes to a 4 node cluster and
                      gluster peer status is:

                      # gluster peer status <-- from node
                        gluster01

                      Number of Peers: 5

                      Hostname: 10.0.231.51

                      Uuid: b01de59a-4428-486b-af49-cb486ab44a07

                      State: Peer in Cluster (Connected)

                      Hostname: 10.0.231.52

                      Uuid: 75143760-52a3-4583-82bb-a9920b283dac

                      State: Peer in Cluster (Connected)

                      Hostname: 10.0.231.53

                      Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411

                      State: Peer in Cluster (Connected)

                      Hostname: 10.0.231.54 <-- new node
                        gluster05

                      Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c

                      State: Peer Rejected (Connected)

                      Hostname: 10.0.231.55 <-- new node gluster06

                      Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3

                      State: Peer Rejected (Connected)

                 Looks like your configuration files are
                mismatching, ie the checksum calculation differs on this
                two node than the others,

                Did you had any failed commit ?

                Compare your /var/lib/glusterd/<volname>/info of
                the failed node against good one, mostly you could see
                some difference.

                can you paste the /var/lib/glusterd/<volname>/info
                ?

                Regards

                Rafi KC

                      I followed the write-up here: http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected
                        and the two new nodes peer'd properly but after
                        a reboot of the two new nodes I'm seeing the
                        same Peer Rejected (Connected) State.

                      I've attached logs from an existing node, and
                        the two new nodes.

                      Thanks for any suggestions,
                      Steve

                  _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users