Re: Upgrading from Gluster 3.8 to 3.12

Hari Gowtham <hgowtham@xxxxxxxxxx> · Wed, 20 Dec 2017 11:31:42 +0530

Yes Atin. I'll take a look.

On Wed, Dec 20, 2017 at 11:28 AM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
> Looks like a bug as I see tier-enabled = 0 is an additional entry in the
> info file in shchhv01. As per the code, this field should be written into
> the glusterd store if the op-version is >= 30706 . What I am guessing is
> since we didn't have the commit 33f8703a1 "glusterd: regenerate volfiles on
> op-version bump up" in 3.8.4 while bumping up the op-version the info and
> volfiles were not regenerated which caused the tier-enabled entry to be
> missing in the info file.
>
> For now, you can copy the info file for the volumes where the mismatch
> happened from shchhv01 to shchhv02 and restart glusterd service on shchhv02.
> That should fix up this temporarily. Unfortunately this step might need to
> be repeated for other nodes as well.
>
> @Hari - Could you help in debugging this further.
>
>
>
> On Wed, Dec 20, 2017 at 10:44 AM, Gustave Dahl <gustave@xxxxxxxxxxxxxx>
> wrote:
>>
>> I was attempting the same on a local sandbox and also have the same
>> problem.
>>
>>
>> Current: 3.8.4
>>
>> Volume Name: shchst01
>> Type: Distributed-Replicate
>> Volume ID: bcd53e52-cde6-4e58-85f9-71d230b7b0d3
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 4 x 3 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: shchhv01-sto:/data/brick3/shchst01
>> Brick2: shchhv02-sto:/data/brick3/shchst01
>> Brick3: shchhv03-sto:/data/brick3/shchst01
>> Brick4: shchhv01-sto:/data/brick1/shchst01
>> Brick5: shchhv02-sto:/data/brick1/shchst01
>> Brick6: shchhv03-sto:/data/brick1/shchst01
>> Brick7: shchhv02-sto:/data/brick2/shchst01
>> Brick8: shchhv03-sto:/data/brick2/shchst01
>> Brick9: shchhv04-sto:/data/brick2/shchst01
>> Brick10: shchhv02-sto:/data/brick4/shchst01
>> Brick11: shchhv03-sto:/data/brick4/shchst01
>> Brick12: shchhv04-sto:/data/brick4/shchst01
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> features.shard-block-size: 512MB
>> features.shard: enable
>> performance.readdir-ahead: on
>> storage.owner-uid: 9869
>> storage.owner-gid: 9869
>> server.allow-insecure: on
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io-cache: off
>> performance.stat-prefetch: off
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> cluster.self-heal-daemon: on
>> nfs.disable: on
>> performance.io-thread-count: 64
>> performance.cache-size: 1GB
>>
>> Upgraded shchhv01-sto to 3.12.3, others remain at 3.8.4
>>
>> RESULT
>> =====================
>> Hostname: shchhv01-sto
>> Uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
>> State: Peer Rejected (Connected)
>>
>> Upgraded Server:  shchhv01-sto
>> ==============================
>> [2017-12-20 05:02:44.747313] I [MSGID: 101190]
>> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
>> with
>> index 1
>> [2017-12-20 05:02:44.747387] I [MSGID: 101190]
>> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
>> with
>> index 2
>> [2017-12-20 05:02:44.749087] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
>> 0-management: RPC_CLNT_PING notify failed
>> [2017-12-20 05:02:44.749165] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
>> 0-management: RPC_CLNT_PING notify failed
>> [2017-12-20 05:02:44.749563] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
>> 0-management: RPC_CLNT_PING notify failed
>> [2017-12-20 05:02:54.676324] I [MSGID: 106493]
>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
>> RJT
>> from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272, host: shchhv02-sto, port:
>> 0
>> [2017-12-20 05:02:54.690237] I [MSGID: 106163]
>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
>> 0-management:
>> using the op-version 30800
>> [2017-12-20 05:02:54.695823] I [MSGID: 106490]
>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
>> 0-glusterd:
>> Received probe from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272
>> [2017-12-20 05:02:54.696956] E [MSGID: 106010]
>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
>> Version
>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
>> 2747317484 on peer shchhv02-sto
>> [2017-12-20 05:02:54.697796] I [MSGID: 106493]
>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
>> Responded to shchhv02-sto (0), ret: 0, op_ret: -1
>> [2017-12-20 05:02:55.033822] I [MSGID: 106493]
>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
>> RJT
>> from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b, host: shchhv03-sto, port:
>> 0
>> [2017-12-20 05:02:55.038460] I [MSGID: 106163]
>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
>> 0-management:
>> using the op-version 30800
>> [2017-12-20 05:02:55.040032] I [MSGID: 106490]
>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
>> 0-glusterd:
>> Received probe from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b
>> [2017-12-20 05:02:55.040266] E [MSGID: 106010]
>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
>> Version
>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
>> 2747317484 on peer shchhv03-sto
>> [2017-12-20 05:02:55.040405] I [MSGID: 106493]
>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
>> Responded to shchhv03-sto (0), ret: 0, op_ret: -1
>> [2017-12-20 05:02:55.584854] I [MSGID: 106493]
>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
>> RJT
>> from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5, host: shchhv04-sto, port:
>> 0
>> [2017-12-20 05:02:55.595125] I [MSGID: 106163]
>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
>> 0-management:
>> using the op-version 30800
>> [2017-12-20 05:02:55.600804] I [MSGID: 106490]
>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
>> 0-glusterd:
>> Received probe from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5
>> [2017-12-20 05:02:55.601288] E [MSGID: 106010]
>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
>> Version
>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
>> 2747317484 on peer shchhv04-sto
>> [2017-12-20 05:02:55.601497] I [MSGID: 106493]
>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
>> Responded to shchhv04-sto (0), ret: 0, op_ret: -1
>>
>> Another Server:  shchhv02-sto
>> ==============================
>> [2017-12-20 05:02:44.667833] W
>> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
>> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c)
>> [0x7f75fdc12e5c]
>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08)
>> [0x7f75fdc1ca08]
>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa)
>> [0x7f75fdcc57fa] ) 0-management: Lock for vol shchst01-sto not held
>> [2017-12-20 05:02:44.667795] I [MSGID: 106004]
>> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
>> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer
>> Rejected>, has disconnected from glusterd.
>> [2017-12-20 05:02:44.667948] W [MSGID: 106118]
>> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock
>> not
>> released for shchst01-sto
>> [2017-12-20 05:02:44.760103] I [MSGID: 106163]
>> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack]
>> 0-management:
>> using the op-version 30800
>> [2017-12-20 05:02:44.765389] I [MSGID: 106490]
>> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req]
>> 0-glusterd:
>> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
>> [2017-12-20 05:02:54.686185] E [MSGID: 106010]
>> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management:
>> Version
>> of Cksums shchst01 differ. local cksum = 2747317484, remote cksum =
>> 4218452135 on peer shchhv01-sto
>> [2017-12-20 05:02:54.686882] I [MSGID: 106493]
>> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd:
>> Responded to shchhv01-sto (0), ret: 0, op_ret: -1
>> [2017-12-20 05:02:54.717854] I [MSGID: 106493]
>> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received
>> RJT
>> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto, port:
>> 0
>>
>> Another Server:  shchhv04-sto
>> ==============================
>> [2017-12-20 05:02:44.667620] I [MSGID: 106004]
>> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
>> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer
>> Rejected>, has disconnected from glusterd.
>> [2017-12-20 05:02:44.667808] W
>> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
>> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c)
>> [0x7f10a33d9e5c]
>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08)
>> [0x7f10a33e3a08]
>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa)
>> [0x7f10a348c7fa] ) 0-management: Lock for vol shchst01-sto not held
>> [2017-12-20 05:02:44.667827] W [MSGID: 106118]
>> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock
>> not
>> released for shchst01-sto
>> [2017-12-20 05:02:44.760077] I [MSGID: 106163]
>> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack]
>> 0-management:
>> using the op-version 30800
>> [2017-12-20 05:02:44.768796] I [MSGID: 106490]
>> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req]
>> 0-glusterd:
>> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
>> [2017-12-20 05:02:55.595095] E [MSGID: 106010]
>> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management:
>> Version
>> of Cksums shchst01-sto differ. local cksum = 2747317484, remote cksum =
>> 4218452135 on peer shchhv01-sto
>> [2017-12-20 05:02:55.595273] I [MSGID: 106493]
>> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd:
>> Responded to shchhv01-sto (0), ret: 0, op_ret: -1
>> [2017-12-20 05:02:55.612957] I [MSGID: 106493]
>> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received
>> RJT
>> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto, port:
>> 0
>>
>> <vol>/info
>>
>> Upgraded Server: shchst01-sto
>> =========================
>> type=2
>> count=12
>> status=1
>> sub_count=3
>> stripe_count=1
>> replica_count=3
>> disperse_count=0
>> redundancy_count=0
>> version=52
>> transport-type=0
>> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3
>> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc
>> password=58652573-0955-4d00-893a-9f42d0f16717
>> op-version=30700
>> client-op-version=30700
>> quota-version=0
>> tier-enabled=0
>> parent_volname=N/A
>> restored_from_snap=00000000-0000-0000-0000-000000000000
>> snap-max-hard-limit=256
>> cluster.data-self-heal-algorithm=full
>> features.shard-block-size=512MB
>> features.shard=enable
>> nfs.disable=on
>> cluster.self-heal-daemon=on
>> cluster.server-quorum-type=server
>> cluster.quorum-type=auto
>> network.remote-dio=enable
>> cluster.eager-lock=enable
>> performance.stat-prefetch=off
>> performance.io-cache=off
>> performance.read-ahead=off
>> performance.quick-read=off
>> server.allow-insecure=on
>> storage.owner-gid=9869
>> storage.owner-uid=9869
>> performance.readdir-ahead=on
>> performance.io-thread-count=64
>> performance.cache-size=1GB
>> brick-0=shchhv01-sto:-data-brick3-shchst01
>> brick-1=shchhv02-sto:-data-brick3-shchst01
>> brick-2=shchhv03-sto:-data-brick3-shchst01
>> brick-3=shchhv01-sto:-data-brick1-shchst01
>> brick-4=shchhv02-sto:-data-brick1-shchst01
>> brick-5=shchhv03-sto:-data-brick1-shchst01
>> brick-6=shchhv02-sto:-data-brick2-shchst01
>> brick-7=shchhv03-sto:-data-brick2-shchst01
>> brick-8=shchhv04-sto:-data-brick2-shchst01
>> brick-9=shchhv02-sto:-data-brick4-shchst01
>> brick-10=shchhv03-sto:-data-brick4-shchst01
>> brick-11=shchhv04-sto:-data-brick4-shchst01
>>
>> Another Server:  shchhv02-sto
>> ==============================
>> type=2
>> count=12
>> status=1
>> sub_count=3
>> stripe_count=1
>> replica_count=3
>> disperse_count=0
>> redundancy_count=0
>> version=52
>> transport-type=0
>> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3
>> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc
>> password=58652573-0955-4d00-893a-9f42d0f16717
>> op-version=30700
>> client-op-version=30700
>> quota-version=0
>> parent_volname=N/A
>> restored_from_snap=00000000-0000-0000-0000-000000000000
>> snap-max-hard-limit=256
>> cluster.data-self-heal-algorithm=full
>> features.shard-block-size=512MB
>> features.shard=enable
>> performance.readdir-ahead=on
>> storage.owner-uid=9869
>> storage.owner-gid=9869
>> server.allow-insecure=on
>> performance.quick-read=off
>> performance.read-ahead=off
>> performance.io-cache=off
>> performance.stat-prefetch=off
>> cluster.eager-lock=enable
>> network.remote-dio=enable
>> cluster.quorum-type=auto
>> cluster.server-quorum-type=server
>> cluster.self-heal-daemon=on
>> nfs.disable=on
>> performance.io-thread-count=64
>> performance.cache-size=1GB
>> brick-0=shchhv01-sto:-data-brick3-shchst01
>> brick-1=shchhv02-sto:-data-brick3-shchst01
>> brick-2=shchhv03-sto:-data-brick3-shchst01
>> brick-3=shchhv01-sto:-data-brick1-shchst01
>> brick-4=shchhv02-sto:-data-brick1-shchst01
>> brick-5=shchhv03-sto:-data-brick1-shchst01
>> brick-6=shchhv02-sto:-data-brick2-shchst01
>> brick-7=shchhv03-sto:-data-brick2-shchst01
>> brick-8=shchhv04-sto:-data-brick2-shchst01
>> brick-9=shchhv02-sto:-data-brick4-shchst01
>> brick-10=shchhv03-sto:-data-brick4-shchst01
>> brick-11=shchhv04-sto:-data-brick4-shchst01
>>
>> NOTE
>>
>> [root@shchhv01 shchst01]# gluster volume get shchst01 cluster.op-version
>> Warning: Support to get global option value using `volume get <volname>`
>> will be deprecated from next release. Consider using `volume get all`
>> instead for global options
>> Option                                  Value
>>
>> ------                                  -----
>>
>> cluster.op-version                      30800
>>
>> [root@shchhv02 shchst01]# gluster volume get shchst01 cluster.op-version
>> Option                                  Value
>>
>> ------                                  -----
>>
>> cluster.op-version                      30800
>>
>> -----Original Message-----
>> From: gluster-users-bounces@xxxxxxxxxxx
>> [mailto:gluster-users-bounces@xxxxxxxxxxx] On Behalf Of Ziemowit Pierzycki
>> Sent: Tuesday, December 19, 2017 3:56 PM
>> To: gluster-users <gluster-users@xxxxxxxxxxx>
>> Subject: Re:  Upgrading from Gluster 3.8 to 3.12
>>
>> I have not done the upgrade yet.  Since this is a production cluster I
>> need
>> to make sure it stays up or schedule some downtime if it doesn't doesn't.
>> Thanks.
>>
>> On Tue, Dec 19, 2017 at 10:11 AM, Atin Mukherjee <amukherj@xxxxxxxxxx>
>> wrote:
>> >
>> >
>> > On Tue, Dec 19, 2017 at 1:10 AM, Ziemowit Pierzycki
>> > <ziemowit@xxxxxxxxxxxxx>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a cluster of 10 servers all running Fedora 24 along with
>> >> Gluster 3.8.  I'm planning on doing rolling upgrades to Fedora 27
>> >> with Gluster 3.12.  I saw the documentation and did some testing but
>> >> I would like to run my plan through some (more?) educated minds.
>> >>
>> >> The current setup is:
>> >>
>> >> Volume Name: vol0
>> >> Distributed-Replicate
>> >> Number of Bricks: 2 x (2 + 1) = 6
>> >> Bricks:
>> >> Brick1: glt01:/vol/vol0
>> >> Brick2: glt02:/vol/vol0
>> >> Brick3: glt05:/vol/vol0 (arbiter)
>> >> Brick4: glt03:/vol/vol0
>> >> Brick5: glt04:/vol/vol0
>> >> Brick6: glt06:/vol/vol0 (arbiter)
>> >>
>> >> Volume Name: vol1
>> >> Distributed-Replicate
>> >> Number of Bricks: 2 x (2 + 1) = 6
>> >> Bricks:
>> >> Brick1: glt07:/vol/vol1
>> >> Brick2: glt08:/vol/vol1
>> >> Brick3: glt05:/vol/vol1 (arbiter)
>> >> Brick4: glt09:/vol/vol1
>> >> Brick5: glt10:/vol/vol1
>> >> Brick6: glt06:/vol/vol1 (arbiter)
>> >>
>> >> After performing the upgrade because of differences in checksums, the
>> >> upgraded nodes will become:
>> >>
>> >> State: Peer Rejected (Connected)
>> >
>> >
>> > Have you upgraded all the nodes? If yes, have you bumped up the
>> > cluster.op-version after upgrading all the nodes? Please follow :
>> > http://docs.gluster.org/en/latest/Upgrade-Guide/op_version/ for more
>> > details on how to bump up the cluster.op-version. In case you have
>> > done all of these and you're seeing a checksum issue then I'm afraid
>> > you have hit a bug. I'd need further details like the checksum
>> > mismatch error from glusterd.log file along with the the exact
>> > volume's info file from /var/lib/glusterd/vols/<volname>/info between
>> > both the peers to debug this further.
>> >
>> >>
>> >> If I start doing the upgrades one at a time, with nodes glt10 to
>> >> glt01 except for the arbiters glt05 and glt06, and then upgrading the
>> >> arbiters last, everything should remain online at all times through
>> >> the process.  Correct?
>> >>
>> >> Thanks.
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> Gluster-users@xxxxxxxxxxx
>> >> http://lists.gluster.org/mailman/listinfo/gluster-users
>> >
>> >
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>

-- 
Regards,
Hari Gowtham.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users