On 12/20/2017 01:01 AM, Hari Gowtham wrote: > Yes Atin. I'll take a look. Once we have a root cause and a way around, please document this in the upgrade procedure in our docs as well. That way future problems have a documented solution (outside of the lists as well). Thanks! > > On Wed, Dec 20, 2017 at 11:28 AM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote: >> Looks like a bug as I see tier-enabled = 0 is an additional entry in the >> info file in shchhv01. As per the code, this field should be written into >> the glusterd store if the op-version is >= 30706 . What I am guessing is >> since we didn't have the commit 33f8703a1 "glusterd: regenerate volfiles on >> op-version bump up" in 3.8.4 while bumping up the op-version the info and >> volfiles were not regenerated which caused the tier-enabled entry to be >> missing in the info file. >> >> For now, you can copy the info file for the volumes where the mismatch >> happened from shchhv01 to shchhv02 and restart glusterd service on shchhv02. >> That should fix up this temporarily. Unfortunately this step might need to >> be repeated for other nodes as well. >> >> @Hari - Could you help in debugging this further. >> >> >> >> On Wed, Dec 20, 2017 at 10:44 AM, Gustave Dahl <gustave@xxxxxxxxxxxxxx> >> wrote: >>> >>> I was attempting the same on a local sandbox and also have the same >>> problem. >>> >>> >>> Current: 3.8.4 >>> >>> Volume Name: shchst01 >>> Type: Distributed-Replicate >>> Volume ID: bcd53e52-cde6-4e58-85f9-71d230b7b0d3 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 4 x 3 = 12 >>> Transport-type: tcp >>> Bricks: >>> Brick1: shchhv01-sto:/data/brick3/shchst01 >>> Brick2: shchhv02-sto:/data/brick3/shchst01 >>> Brick3: shchhv03-sto:/data/brick3/shchst01 >>> Brick4: shchhv01-sto:/data/brick1/shchst01 >>> Brick5: shchhv02-sto:/data/brick1/shchst01 >>> Brick6: shchhv03-sto:/data/brick1/shchst01 >>> Brick7: shchhv02-sto:/data/brick2/shchst01 >>> Brick8: shchhv03-sto:/data/brick2/shchst01 >>> Brick9: shchhv04-sto:/data/brick2/shchst01 >>> Brick10: shchhv02-sto:/data/brick4/shchst01 >>> Brick11: shchhv03-sto:/data/brick4/shchst01 >>> Brick12: shchhv04-sto:/data/brick4/shchst01 >>> Options Reconfigured: >>> cluster.data-self-heal-algorithm: full >>> features.shard-block-size: 512MB >>> features.shard: enable >>> performance.readdir-ahead: on >>> storage.owner-uid: 9869 >>> storage.owner-gid: 9869 >>> server.allow-insecure: on >>> performance.quick-read: off >>> performance.read-ahead: off >>> performance.io-cache: off >>> performance.stat-prefetch: off >>> cluster.eager-lock: enable >>> network.remote-dio: enable >>> cluster.quorum-type: auto >>> cluster.server-quorum-type: server >>> cluster.self-heal-daemon: on >>> nfs.disable: on >>> performance.io-thread-count: 64 >>> performance.cache-size: 1GB >>> >>> Upgraded shchhv01-sto to 3.12.3, others remain at 3.8.4 >>> >>> RESULT >>> ===================== >>> Hostname: shchhv01-sto >>> Uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816 >>> State: Peer Rejected (Connected) >>> >>> Upgraded Server: shchhv01-sto >>> ============================== >>> [2017-12-20 05:02:44.747313] I [MSGID: 101190] >>> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with >>> index 1 >>> [2017-12-20 05:02:44.747387] I [MSGID: 101190] >>> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with >>> index 2 >>> [2017-12-20 05:02:44.749087] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk] >>> 0-management: RPC_CLNT_PING notify failed >>> [2017-12-20 05:02:44.749165] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk] >>> 0-management: RPC_CLNT_PING notify failed >>> [2017-12-20 05:02:44.749563] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk] >>> 0-management: RPC_CLNT_PING notify failed >>> [2017-12-20 05:02:54.676324] I [MSGID: 106493] >>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received >>> RJT >>> from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272, host: shchhv02-sto, port: >>> 0 >>> [2017-12-20 05:02:54.690237] I [MSGID: 106163] >>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] >>> 0-management: >>> using the op-version 30800 >>> [2017-12-20 05:02:54.695823] I [MSGID: 106490] >>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] >>> 0-glusterd: >>> Received probe from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272 >>> [2017-12-20 05:02:54.696956] E [MSGID: 106010] >>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management: >>> Version >>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum = >>> 2747317484 on peer shchhv02-sto >>> [2017-12-20 05:02:54.697796] I [MSGID: 106493] >>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: >>> Responded to shchhv02-sto (0), ret: 0, op_ret: -1 >>> [2017-12-20 05:02:55.033822] I [MSGID: 106493] >>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received >>> RJT >>> from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b, host: shchhv03-sto, port: >>> 0 >>> [2017-12-20 05:02:55.038460] I [MSGID: 106163] >>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] >>> 0-management: >>> using the op-version 30800 >>> [2017-12-20 05:02:55.040032] I [MSGID: 106490] >>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] >>> 0-glusterd: >>> Received probe from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b >>> [2017-12-20 05:02:55.040266] E [MSGID: 106010] >>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management: >>> Version >>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum = >>> 2747317484 on peer shchhv03-sto >>> [2017-12-20 05:02:55.040405] I [MSGID: 106493] >>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: >>> Responded to shchhv03-sto (0), ret: 0, op_ret: -1 >>> [2017-12-20 05:02:55.584854] I [MSGID: 106493] >>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received >>> RJT >>> from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5, host: shchhv04-sto, port: >>> 0 >>> [2017-12-20 05:02:55.595125] I [MSGID: 106163] >>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] >>> 0-management: >>> using the op-version 30800 >>> [2017-12-20 05:02:55.600804] I [MSGID: 106490] >>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] >>> 0-glusterd: >>> Received probe from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5 >>> [2017-12-20 05:02:55.601288] E [MSGID: 106010] >>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management: >>> Version >>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum = >>> 2747317484 on peer shchhv04-sto >>> [2017-12-20 05:02:55.601497] I [MSGID: 106493] >>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: >>> Responded to shchhv04-sto (0), ret: 0, op_ret: -1 >>> >>> Another Server: shchhv02-sto >>> ============================== >>> [2017-12-20 05:02:44.667833] W >>> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] >>> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c) >>> [0x7f75fdc12e5c] >>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08) >>> [0x7f75fdc1ca08] >>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa) >>> [0x7f75fdcc57fa] ) 0-management: Lock for vol shchst01-sto not held >>> [2017-12-20 05:02:44.667795] I [MSGID: 106004] >>> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer >>> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer >>> Rejected>, has disconnected from glusterd. >>> [2017-12-20 05:02:44.667948] W [MSGID: 106118] >>> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock >>> not >>> released for shchst01-sto >>> [2017-12-20 05:02:44.760103] I [MSGID: 106163] >>> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack] >>> 0-management: >>> using the op-version 30800 >>> [2017-12-20 05:02:44.765389] I [MSGID: 106490] >>> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req] >>> 0-glusterd: >>> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816 >>> [2017-12-20 05:02:54.686185] E [MSGID: 106010] >>> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management: >>> Version >>> of Cksums shchst01 differ. local cksum = 2747317484, remote cksum = >>> 4218452135 on peer shchhv01-sto >>> [2017-12-20 05:02:54.686882] I [MSGID: 106493] >>> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd: >>> Responded to shchhv01-sto (0), ret: 0, op_ret: -1 >>> [2017-12-20 05:02:54.717854] I [MSGID: 106493] >>> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received >>> RJT >>> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto, port: >>> 0 >>> >>> Another Server: shchhv04-sto >>> ============================== >>> [2017-12-20 05:02:44.667620] I [MSGID: 106004] >>> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer >>> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer >>> Rejected>, has disconnected from glusterd. >>> [2017-12-20 05:02:44.667808] W >>> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] >>> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c) >>> [0x7f10a33d9e5c] >>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08) >>> [0x7f10a33e3a08] >>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa) >>> [0x7f10a348c7fa] ) 0-management: Lock for vol shchst01-sto not held >>> [2017-12-20 05:02:44.667827] W [MSGID: 106118] >>> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock >>> not >>> released for shchst01-sto >>> [2017-12-20 05:02:44.760077] I [MSGID: 106163] >>> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack] >>> 0-management: >>> using the op-version 30800 >>> [2017-12-20 05:02:44.768796] I [MSGID: 106490] >>> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req] >>> 0-glusterd: >>> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816 >>> [2017-12-20 05:02:55.595095] E [MSGID: 106010] >>> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management: >>> Version >>> of Cksums shchst01-sto differ. local cksum = 2747317484, remote cksum = >>> 4218452135 on peer shchhv01-sto >>> [2017-12-20 05:02:55.595273] I [MSGID: 106493] >>> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd: >>> Responded to shchhv01-sto (0), ret: 0, op_ret: -1 >>> [2017-12-20 05:02:55.612957] I [MSGID: 106493] >>> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received >>> RJT >>> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto, port: >>> 0 >>> >>> <vol>/info >>> >>> Upgraded Server: shchst01-sto >>> ========================= >>> type=2 >>> count=12 >>> status=1 >>> sub_count=3 >>> stripe_count=1 >>> replica_count=3 >>> disperse_count=0 >>> redundancy_count=0 >>> version=52 >>> transport-type=0 >>> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3 >>> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc >>> password=58652573-0955-4d00-893a-9f42d0f16717 >>> op-version=30700 >>> client-op-version=30700 >>> quota-version=0 >>> tier-enabled=0 >>> parent_volname=N/A >>> restored_from_snap=00000000-0000-0000-0000-000000000000 >>> snap-max-hard-limit=256 >>> cluster.data-self-heal-algorithm=full >>> features.shard-block-size=512MB >>> features.shard=enable >>> nfs.disable=on >>> cluster.self-heal-daemon=on >>> cluster.server-quorum-type=server >>> cluster.quorum-type=auto >>> network.remote-dio=enable >>> cluster.eager-lock=enable >>> performance.stat-prefetch=off >>> performance.io-cache=off >>> performance.read-ahead=off >>> performance.quick-read=off >>> server.allow-insecure=on >>> storage.owner-gid=9869 >>> storage.owner-uid=9869 >>> performance.readdir-ahead=on >>> performance.io-thread-count=64 >>> performance.cache-size=1GB >>> brick-0=shchhv01-sto:-data-brick3-shchst01 >>> brick-1=shchhv02-sto:-data-brick3-shchst01 >>> brick-2=shchhv03-sto:-data-brick3-shchst01 >>> brick-3=shchhv01-sto:-data-brick1-shchst01 >>> brick-4=shchhv02-sto:-data-brick1-shchst01 >>> brick-5=shchhv03-sto:-data-brick1-shchst01 >>> brick-6=shchhv02-sto:-data-brick2-shchst01 >>> brick-7=shchhv03-sto:-data-brick2-shchst01 >>> brick-8=shchhv04-sto:-data-brick2-shchst01 >>> brick-9=shchhv02-sto:-data-brick4-shchst01 >>> brick-10=shchhv03-sto:-data-brick4-shchst01 >>> brick-11=shchhv04-sto:-data-brick4-shchst01 >>> >>> Another Server: shchhv02-sto >>> ============================== >>> type=2 >>> count=12 >>> status=1 >>> sub_count=3 >>> stripe_count=1 >>> replica_count=3 >>> disperse_count=0 >>> redundancy_count=0 >>> version=52 >>> transport-type=0 >>> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3 >>> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc >>> password=58652573-0955-4d00-893a-9f42d0f16717 >>> op-version=30700 >>> client-op-version=30700 >>> quota-version=0 >>> parent_volname=N/A >>> restored_from_snap=00000000-0000-0000-0000-000000000000 >>> snap-max-hard-limit=256 >>> cluster.data-self-heal-algorithm=full >>> features.shard-block-size=512MB >>> features.shard=enable >>> performance.readdir-ahead=on >>> storage.owner-uid=9869 >>> storage.owner-gid=9869 >>> server.allow-insecure=on >>> performance.quick-read=off >>> performance.read-ahead=off >>> performance.io-cache=off >>> performance.stat-prefetch=off >>> cluster.eager-lock=enable >>> network.remote-dio=enable >>> cluster.quorum-type=auto >>> cluster.server-quorum-type=server >>> cluster.self-heal-daemon=on >>> nfs.disable=on >>> performance.io-thread-count=64 >>> performance.cache-size=1GB >>> brick-0=shchhv01-sto:-data-brick3-shchst01 >>> brick-1=shchhv02-sto:-data-brick3-shchst01 >>> brick-2=shchhv03-sto:-data-brick3-shchst01 >>> brick-3=shchhv01-sto:-data-brick1-shchst01 >>> brick-4=shchhv02-sto:-data-brick1-shchst01 >>> brick-5=shchhv03-sto:-data-brick1-shchst01 >>> brick-6=shchhv02-sto:-data-brick2-shchst01 >>> brick-7=shchhv03-sto:-data-brick2-shchst01 >>> brick-8=shchhv04-sto:-data-brick2-shchst01 >>> brick-9=shchhv02-sto:-data-brick4-shchst01 >>> brick-10=shchhv03-sto:-data-brick4-shchst01 >>> brick-11=shchhv04-sto:-data-brick4-shchst01 >>> >>> NOTE >>> >>> [root@shchhv01 shchst01]# gluster volume get shchst01 cluster.op-version >>> Warning: Support to get global option value using `volume get <volname>` >>> will be deprecated from next release. Consider using `volume get all` >>> instead for global options >>> Option Value >>> >>> ------ ----- >>> >>> cluster.op-version 30800 >>> >>> [root@shchhv02 shchst01]# gluster volume get shchst01 cluster.op-version >>> Option Value >>> >>> ------ ----- >>> >>> cluster.op-version 30800 >>> >>> -----Original Message----- >>> From: gluster-users-bounces@xxxxxxxxxxx >>> [mailto:gluster-users-bounces@xxxxxxxxxxx] On Behalf Of Ziemowit Pierzycki >>> Sent: Tuesday, December 19, 2017 3:56 PM >>> To: gluster-users <gluster-users@xxxxxxxxxxx> >>> Subject: Re: Upgrading from Gluster 3.8 to 3.12 >>> >>> I have not done the upgrade yet. Since this is a production cluster I >>> need >>> to make sure it stays up or schedule some downtime if it doesn't doesn't. >>> Thanks. >>> >>> On Tue, Dec 19, 2017 at 10:11 AM, Atin Mukherjee <amukherj@xxxxxxxxxx> >>> wrote: >>>> >>>> >>>> On Tue, Dec 19, 2017 at 1:10 AM, Ziemowit Pierzycki >>>> <ziemowit@xxxxxxxxxxxxx> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have a cluster of 10 servers all running Fedora 24 along with >>>>> Gluster 3.8. I'm planning on doing rolling upgrades to Fedora 27 >>>>> with Gluster 3.12. I saw the documentation and did some testing but >>>>> I would like to run my plan through some (more?) educated minds. >>>>> >>>>> The current setup is: >>>>> >>>>> Volume Name: vol0 >>>>> Distributed-Replicate >>>>> Number of Bricks: 2 x (2 + 1) = 6 >>>>> Bricks: >>>>> Brick1: glt01:/vol/vol0 >>>>> Brick2: glt02:/vol/vol0 >>>>> Brick3: glt05:/vol/vol0 (arbiter) >>>>> Brick4: glt03:/vol/vol0 >>>>> Brick5: glt04:/vol/vol0 >>>>> Brick6: glt06:/vol/vol0 (arbiter) >>>>> >>>>> Volume Name: vol1 >>>>> Distributed-Replicate >>>>> Number of Bricks: 2 x (2 + 1) = 6 >>>>> Bricks: >>>>> Brick1: glt07:/vol/vol1 >>>>> Brick2: glt08:/vol/vol1 >>>>> Brick3: glt05:/vol/vol1 (arbiter) >>>>> Brick4: glt09:/vol/vol1 >>>>> Brick5: glt10:/vol/vol1 >>>>> Brick6: glt06:/vol/vol1 (arbiter) >>>>> >>>>> After performing the upgrade because of differences in checksums, the >>>>> upgraded nodes will become: >>>>> >>>>> State: Peer Rejected (Connected) >>>> >>>> >>>> Have you upgraded all the nodes? If yes, have you bumped up the >>>> cluster.op-version after upgrading all the nodes? Please follow : >>>> http://docs.gluster.org/en/latest/Upgrade-Guide/op_version/ for more >>>> details on how to bump up the cluster.op-version. In case you have >>>> done all of these and you're seeing a checksum issue then I'm afraid >>>> you have hit a bug. I'd need further details like the checksum >>>> mismatch error from glusterd.log file along with the the exact >>>> volume's info file from /var/lib/glusterd/vols/<volname>/info between >>>> both the peers to debug this further. >>>> >>>>> >>>>> If I start doing the upgrades one at a time, with nodes glt10 to >>>>> glt01 except for the arbiters glt05 and glt06, and then upgrading the >>>>> arbiters last, everything should remain online at all times through >>>>> the process. Correct? >>>>> >>>>> Thanks. >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users@xxxxxxxxxxx >>>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@xxxxxxxxxxx >>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@xxxxxxxxxxx >>> http://lists.gluster.org/mailman/listinfo/gluster-users >> >> > > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users