I changed quota-version=1 on the two new nodes, and was able to join the cluster. I also rebooted the two new nodes and everything came up correctly.
Then I triggered a rebalance fix-layout and one of the original cluster members (node gluster03) glusterd crashed. I restarted glusterd and was connected but after a few minutes I'm left with:
# gluster peer status
Number of Peers: 5
Hostname: 10.0.231.51
Uuid: b01de59a-4428-486b-af49-cb486ab44a07
State: Peer in Cluster (Connected)
Hostname: 10.0.231.52
Uuid: 75143760-52a3-4583-82bb-a9920b283dac
State: Peer Rejected (Connected)
Hostname: 10.0.231.53
Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
State: Peer in Cluster (Connected)
Hostname: 10.0.231.54
Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
State: Peer in Cluster (Connected)
Hostname: 10.0.231.55
Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3
State: Peer in Cluster (Connected)
I see in the logs (attached) there is now a cksum error:
[2016-02-29 19:16:42.082256] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.55
[2016-02-29 19:16:42.082298] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.55 (0), ret: 0
[2016-02-29 19:16:42.092535] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411, host: 10.0.231.53, port: 0
[2016-02-29 19:16:42.096036] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-export-domain-storage/export-domain-storage on port 49153
[2016-02-29 19:16:42.097296] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-vm-storage/vm-storage on port 49155
[2016-02-29 19:16:42.100727] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.108495] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
[2016-02-29 19:16:42.109295] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.53
[2016-02-29 19:16:42.109338] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.53 (0), ret: 0
[2016-02-29 19:16:42.119521] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-env-modules/env-modules on port 49157
[2016-02-29 19:16:42.122856] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/raid6-storage/storage on port 49156
[2016-02-29 19:16:42.508104] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: b01de59a-4428-486b-af49-cb486ab44a07, host: 10.0.231.51, port: 0
[2016-02-29 19:16:42.519403] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.524353] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b01de59a-4428-486b-af49-cb486ab44a07
[2016-02-29 19:16:42.524999] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.51
[2016-02-29 19:16:42.525038] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.51 (0), ret: 0
[2016-02-29 19:16:42.592523] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c, host: 10.0.231.54, port: 0
[2016-02-29 19:16:42.599518] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.604821] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
[2016-02-29 19:16:42.605458] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.54
[2016-02-29 19:16:42.605492] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.54 (0), ret: 0
[2016-02-29 19:16:42.621943] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700
[2016-02-29 19:16:42.628443] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a965e782-39e2-41cc-a0d1-b32ecccdcd2f
[2016-02-29 19:16:42.629079] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.50
On gluster01/02/04/05
/var/lib/glusterd/vols/storage/cksum info=998305000
On gluster03
/var/lib/glusterd/vols/storage/cksum info=998305001
How do I recover from this? Can I just stop glusterd on gluster03 and change the cksum value?
On Thu, Feb 25, 2016 at 12:49 PM, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote:
If all the op-version are same and 3.7.6, then to work-around the issue, you can manually make it quota-version=1, and restarting the glusterd will solve the problem, But I would strongly recommend you to figure out the RCA. May be you can file a bug for this.
On 02/26/2016 01:53 AM, Mohammed Rafi K C wrote:
On 02/26/2016 01:32 AM, Steve Dainard wrote:
I haven't done anything more than peer thus far, so I'm a bit confused as to how the volume info fits in, can you expand on this a bit?
Failed commits? Is this split brain on the replica volumes? I don't get any return from 'gluster volume heal <volname> info' on all the replica volumes, but if I try a gluster volume heal <volname> full I get: 'Launching heal operation to perform full self heal on volume <volname> has been unsuccessful'.
forget about this. it is not for metadata selfheal .
I have 5 volumes total.
'Replica 3' volumes running on gluster01/02/03:vm-storageiso-storageexport-domain-storageenv-modules
And one distributed only volume 'storage' info shown below:
From existing host gluster01/02:type=0count=4status=1sub_count=0stripe_count=1replica_count=1disperse_count=0redundancy_count=0version=25transport-type=0volume-id=26d355cb-c486-481f-ac16-e25390e73775username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4cpassword=op-version=3client-op-version=3quota-version=1parent_volname=N/Arestored_from_snap=00000000-0000-0000-0000-000000000000snap-max-hard-limit=256features.quota-deem-statfs=onfeatures.inode-quota=ondiagnostics.brick-log-level=WARNINGfeatures.quota=onperformance.readdir-ahead=onperformance.cache-size=1GBperformance.stat-prefetch=onbrick-0=10.0.231.50:-mnt-raid6-storage-storagebrick-1=10.0.231.51:-mnt-raid6-storage-storagebrick-2=10.0.231.52:-mnt-raid6-storage-storagebrick-3=10.0.231.53:-mnt-raid6-storage-storage
From existing host gluster03/04:
type=0count=4status=1sub_count=0stripe_count=1replica_count=1disperse_count=0redundancy_count=0version=25transport-type=0volume-id=26d355cb-c486-481f-ac16-e25390e73775username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4cpassword=op-version=3client-op-version=3quota-version=1parent_volname=N/Arestored_from_snap=00000000-0000-0000-0000-000000000000snap-max-hard-limit=256features.quota-deem-statfs=onfeatures.inode-quota=onperformance.stat-prefetch=onperformance.cache-size=1GBperformance.readdir-ahead=onfeatures.quota=ondiagnostics.brick-log-level=WARNINGbrick-0=10.0.231.50:-mnt-raid6-storage-storagebrick-1=10.0.231.51:-mnt-raid6-storage-storagebrick-2=10.0.231.52:-mnt-raid6-storage-storagebrick-3=10.0.231.53:-mnt-raid6-storage-storage
So far between gluster01/02 and gluster03/04 the configs are the same, although the ordering is different for some of the features.
On gluster05/06 the ordering is different again, and the quota-version=0 instead of 1.
This is why the peer shows as rejected. Can you check the op-version of all the glusterd including the one which is in reject state. you can find out the op-version here in /var/lib/glusterd/glusterd.info
Rafi
Rafi KC
From new hosts gluster05/gluster06:type=0count=4status=1sub_count=0stripe_count=1replica_count=1disperse_count=0redundancy_count=0version=25transport-type=0volume-id=26d355cb-c486-481f-ac16-e25390e73775username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4cpassword=op-version=3client-op-version=3quota-version=0parent_volname=N/Arestored_from_snap=00000000-0000-0000-0000-000000000000snap-max-hard-limit=256performance.stat-prefetch=onperformance.cache-size=1GBperformance.readdir-ahead=onfeatures.quota=ondiagnostics.brick-log-level=WARNINGfeatures.inode-quota=onfeatures.quota-deem-statfs=onbrick-0=10.0.231.50:-mnt-raid6-storage-storagebrick-1=10.0.231.51:-mnt-raid6-storage-storagebrick-2=10.0.231.52:-mnt-raid6-storage-storagebrick-3=10.0.231.53:-mnt-raid6-storage-storage
Also, I forgot to mention that when I initially peer'd the two new hosts, glusterd crashed on gluster03 and had to be restarted (log attached) but has been fine since.
Thanks,Steve
On Thu, Feb 25, 2016 at 11:27 AM, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote:
On 02/25/2016 11:45 PM, Steve Dainard wrote:
Hello,
I upgraded from 3.6.6 to 3.7.6 a couple weeks ago. I just peered 2 new nodes to a 4 node cluster and gluster peer status is:
# gluster peer status <-- from node gluster01
Number of Peers: 5
Hostname: 10.0.231.51
Uuid: b01de59a-4428-486b-af49-cb486ab44a07
State: Peer in Cluster (Connected)
Hostname: 10.0.231.52
Uuid: 75143760-52a3-4583-82bb-a9920b283dac
State: Peer in Cluster (Connected)
Hostname: 10.0.231.53
Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
State: Peer in Cluster (Connected)
Hostname: 10.0.231.54 <-- new node gluster05
Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
State: Peer Rejected (Connected)
Hostname: 10.0.231.55 <-- new node gluster06
Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3
State: Peer Rejected (Connected)
Looks like your configuration files are mismatching, ie the checksum calculation differs on this two node than the others,
Did you had any failed commit ?
Compare your /var/lib/glusterd/<volname>/info of the failed node against good one, mostly you could see some difference.
can you paste the /var/lib/glusterd/<volname>/info ?
Regards
Rafi KC
I followed the write-up here: http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected and the two new nodes peer'd properly but after a reboot of the two new nodes I'm seeing the same Peer Rejected (Connected) State.
I've attached logs from an existing node, and the two new nodes.
Thanks for any suggestions,Steve
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users
Attachment:
etc-glusterfs-glusterd.vol.log.gluster03
Description: Binary data
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users