You have to figure out the difference in volinfo in the peers . and rectify it. Or simply you can reduce version in vol info by one in node3 and restating the glusterd will solve the problem.
But I would be more interested to figure out why glusterd crashed.
1) Can you paste back trace of the core generated.
2) Can you paste the op-version for all the nodes.
3) Can you mention steps you did that lead to crash? Seems like you added a brick .
4) If possible can you recollect the order in which you added the peers and the version. Also the upgrade sequence.
May be you can race a bug in bugzilla with the information.
Regards
Rafi KC
On 1 Mar 2016 12:58 am, Steve Dainard <sdainard@xxxxxxxx> wrote:
I changed quota-version=1 on the two new nodes, and was able to join the cluster. I also rebooted the two new nodes and everything came up correctly.Then I triggered a rebalance fix-layout and one of the original cluster members (node gluster03) glusterd crashed. I restarted glusterd and was connected but after a few minutes I'm left with:# gluster peer statusNumber of Peers: 5Hostname: 10.0.231.51Uuid: b01de59a-4428-486b-af49-cb486ab44a07State: Peer in Cluster (Connected)Hostname: 10.0.231.52Uuid: 75143760-52a3-4583-82bb-a9920b283dacState: Peer Rejected (Connected)Hostname: 10.0.231.53Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411State: Peer in Cluster (Connected)Hostname: 10.0.231.54Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9cState: Peer in Cluster (Connected)Hostname: 10.0.231.55Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3State: Peer in Cluster (Connected)I see in the logs (attached) there is now a cksum error:[2016-02-29 19:16:42.082256] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.55[2016-02-29 19:16:42.082298] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.55 (0), ret: 0[2016-02-29 19:16:42.092535] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411, host: 10.0.231.53, port: 0[2016-02-29 19:16:42.096036] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-export-domain-storage/export-domain-storage on port 49153[2016-02-29 19:16:42.097296] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-vm-storage/vm-storage on port 49155[2016-02-29 19:16:42.100727] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700[2016-02-29 19:16:42.108495] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411[2016-02-29 19:16:42.109295] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.53[2016-02-29 19:16:42.109338] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.53 (0), ret: 0[2016-02-29 19:16:42.119521] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-env-modules/env-modules on port 49157[2016-02-29 19:16:42.122856] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/raid6-storage/storage on port 49156[2016-02-29 19:16:42.508104] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: b01de59a-4428-486b-af49-cb486ab44a07, host: 10.0.231.51, port: 0[2016-02-29 19:16:42.519403] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700[2016-02-29 19:16:42.524353] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b01de59a-4428-486b-af49-cb486ab44a07[2016-02-29 19:16:42.524999] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.51[2016-02-29 19:16:42.525038] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.51 (0), ret: 0[2016-02-29 19:16:42.592523] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c, host: 10.0.231.54, port: 0[2016-02-29 19:16:42.599518] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700[2016-02-29 19:16:42.604821] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c[2016-02-29 19:16:42.605458] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.54[2016-02-29 19:16:42.605492] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.54 (0), ret: 0[2016-02-29 19:16:42.621943] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700[2016-02-29 19:16:42.628443] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a965e782-39e2-41cc-a0d1-b32ecccdcd2f[2016-02-29 19:16:42.629079] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.50On gluster01/02/04/05/var/lib/glusterd/vols/storage/cksum info=998305000On gluster03/var/lib/glusterd/vols/storage/cksum info=998305001How do I recover from this? Can I just stop glusterd on gluster03 and change the cksum value?On Thu, Feb 25, 2016 at 12:49 PM, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote:If all the op-version are same and 3.7.6, then to work-around the issue, you can manually make it quota-version=1, and restarting the glusterd will solve the problem, But I would strongly recommend you to figure out the RCA. May be you can file a bug for this.
On 02/26/2016 01:53 AM, Mohammed Rafi K C wrote:
On 02/26/2016 01:32 AM, Steve Dainard wrote:
I haven't done anything more than peer thus far, so I'm a bit confused as to how the volume info fits in, can you expand on this a bit?
Failed commits? Is this split brain on the replica volumes? I don't get any return from 'gluster volume heal <volname> info' on all the replica volumes, but if I try a gluster volume heal <volname> full I get: 'Launching heal operation to perform full self heal on volume <volname> has been unsuccessful'.
forget about this. it is not for metadata selfheal .
I have 5 volumes total.
'Replica 3' volumes running on gluster01/02/03:vm-storageiso-storageexport-domain-storageenv-modules
And one distributed only volume 'storage' info shown below:
From existing host gluster01/02:type=0count=4status=1sub_count=0stripe_count=1replica_count=1disperse_count=0redundancy_count=0version=25transport-type=0volume-id=26d355cb-c486-481f-ac16-e25390e73775username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4cpassword=op-version=3client-op-version=3quota-version=1parent_volname=N/Arestored_from_snap=00000000-0000-0000-0000-000000000000snap-max-hard-limit=256features.quota-deem-statfs=onfeatures.inode-quota=ondiagnostics.brick-log-level=WARNINGfeatures.quota=onperformance.readdir-ahead=onperformance.cache-size=1GBperformance.stat-prefetch=onbrick-0=10.0.231.50:-mnt-raid6-storage-storagebrick-1=10.0.231.51:-mnt-raid6-storage-storagebrick-2=10.0.231.52:-mnt-raid6-storage-storagebrick-3=10.0.231.53:-mnt-raid6-storage-storage
From existing host gluster03/04:
type=0count=4status=1sub_count=0stripe_count=1replica_count=1disperse_count=0redundancy_count=0version=25transport-type=0volume-id=26d355cb-c486-481f-ac16-e25390e73775username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4cpassword=op-version=3client-op-version=3quota-version=1parent_volname=N/Arestored_from_snap=00000000-0000-0000-0000-000000000000snap-max-hard-limit=256features.quota-deem-statfs=onfeatures.inode-quota=onperformance.stat-prefetch=onperformance.cache-size=1GBperformance.readdir-ahead=onfeatures.quota=ondiagnostics.brick-log-level=WARNINGbrick-0=10.0.231.50:-mnt-raid6-storage-storagebrick-1=10.0.231.51:-mnt-raid6-storage-storagebrick-2=10.0.231.52:-mnt-raid6-storage-storagebrick-3=10.0.231.53:-mnt-raid6-storage-storage
So far between gluster01/02 and gluster03/04 the configs are the same, although the ordering is different for some of the features.
On gluster05/06 the ordering is different again, and the quota-version=0 instead of 1.
This is why the peer shows as rejected. Can you check the op-version of all the glusterd including the one which is in reject state. you can find out the op-version here in /var/lib/glusterd/glusterd.info
Rafi
Rafi KC
From new hosts gluster05/gluster06:type=0count=4status=1sub_count=0stripe_count=1replica_count=1disperse_count=0redundancy_count=0version=25transport-type=0volume-id=26d355cb-c486-481f-ac16-e25390e73775username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4cpassword=op-version=3client-op-version=3quota-version=0parent_volname=N/Arestored_from_snap=00000000-0000-0000-0000-000000000000snap-max-hard-limit=256performance.stat-prefetch=onperformance.cache-size=1GBperformance.readdir-ahead=onfeatures.quota=ondiagnostics.brick-log-level=WARNINGfeatures.inode-quota=onfeatures.quota-deem-statfs=onbrick-0=10.0.231.50:-mnt-raid6-storage-storagebrick-1=10.0.231.51:-mnt-raid6-storage-storagebrick-2=10.0.231.52:-mnt-raid6-storage-storagebrick-3=10.0.231.53:-mnt-raid6-storage-storage
Also, I forgot to mention that when I initially peer'd the two new hosts, glusterd crashed on gluster03 and had to be restarted (log attached) but has been fine since.
Thanks,Steve
On Thu, Feb 25, 2016 at 11:27 AM, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote:
On 02/25/2016 11:45 PM, Steve Dainard wrote:
Hello,
I upgraded from 3.6.6 to 3.7.6 a couple weeks ago. I just peered 2 new nodes to a 4 node cluster and gluster peer status is:
# gluster peer status <-- from node gluster01
Number of Peers: 5
Hostname: 10.0.231.51
Uuid: b01de59a-4428-486b-af49-cb486ab44a07
State: Peer in Cluster (Connected)
Hostname: 10.0.231.52
Uuid: 75143760-52a3-4583-82bb-a9920b283dac
State: Peer in Cluster (Connected)
Hostname: 10.0.231.53
Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
State: Peer in Cluster (Connected)
Hostname: 10.0.231.54 <-- new node gluster05
Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
State: Peer Rejected (Connected)
Hostname: 10.0.231.55 <-- new node gluster06
Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3
State: Peer Rejected (Connected)
Looks like your configuration files are mismatching, ie the checksum calculation differs on this two node than the others,
Did you had any failed commit ?
Compare your /var/lib/glusterd/<volname>/info of the failed node against good one, mostly you could see some difference.
can you paste the /var/lib/glusterd/<volname>/info ?
Regards
Rafi KC
I followed the write-up here: http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected and the two new nodes peer'd properly but after a reboot of the two new nodes I'm seeing the same Peer Rejected (Connected) State.
I've attached logs from an existing node, and the two new nodes.
Thanks for any suggestions,Steve
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users