Re: Upgrading from Gluster 3.8 to 3.12

Shyam Ranganathan <srangana@xxxxxxxxxx> · Wed, 20 Dec 2017 09:37:01 -0500

On 12/20/2017 01:01 AM, Hari Gowtham wrote:
> Yes Atin. I'll take a look.

Once we have a root cause and a way around, please document this in the
upgrade procedure in our docs as well. That way future problems have a
documented solution (outside of the lists as well).

Thanks!

> 
> On Wed, Dec 20, 2017 at 11:28 AM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
>> Looks like a bug as I see tier-enabled = 0 is an additional entry in the
>> info file in shchhv01. As per the code, this field should be written into
>> the glusterd store if the op-version is >= 30706 . What I am guessing is
>> since we didn't have the commit 33f8703a1 "glusterd: regenerate volfiles on
>> op-version bump up" in 3.8.4 while bumping up the op-version the info and
>> volfiles were not regenerated which caused the tier-enabled entry to be
>> missing in the info file.
>>
>> For now, you can copy the info file for the volumes where the mismatch
>> happened from shchhv01 to shchhv02 and restart glusterd service on shchhv02.
>> That should fix up this temporarily. Unfortunately this step might need to
>> be repeated for other nodes as well.
>>
>> @Hari - Could you help in debugging this further.
>>
>>
>>
>> On Wed, Dec 20, 2017 at 10:44 AM, Gustave Dahl <gustave@xxxxxxxxxxxxxx>
>> wrote:
>>>
>>> I was attempting the same on a local sandbox and also have the same
>>> problem.
>>>
>>>
>>> Current: 3.8.4
>>>
>>> Volume Name: shchst01
>>> Type: Distributed-Replicate
>>> Volume ID: bcd53e52-cde6-4e58-85f9-71d230b7b0d3
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 4 x 3 = 12
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: shchhv01-sto:/data/brick3/shchst01
>>> Brick2: shchhv02-sto:/data/brick3/shchst01
>>> Brick3: shchhv03-sto:/data/brick3/shchst01
>>> Brick4: shchhv01-sto:/data/brick1/shchst01
>>> Brick5: shchhv02-sto:/data/brick1/shchst01
>>> Brick6: shchhv03-sto:/data/brick1/shchst01
>>> Brick7: shchhv02-sto:/data/brick2/shchst01
>>> Brick8: shchhv03-sto:/data/brick2/shchst01
>>> Brick9: shchhv04-sto:/data/brick2/shchst01
>>> Brick10: shchhv02-sto:/data/brick4/shchst01
>>> Brick11: shchhv03-sto:/data/brick4/shchst01
>>> Brick12: shchhv04-sto:/data/brick4/shchst01
>>> Options Reconfigured:
>>> cluster.data-self-heal-algorithm: full
>>> features.shard-block-size: 512MB
>>> features.shard: enable
>>> performance.readdir-ahead: on
>>> storage.owner-uid: 9869
>>> storage.owner-gid: 9869
>>> server.allow-insecure: on
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> cluster.self-heal-daemon: on
>>> nfs.disable: on
>>> performance.io-thread-count: 64
>>> performance.cache-size: 1GB
>>>
>>> Upgraded shchhv01-sto to 3.12.3, others remain at 3.8.4
>>>
>>> RESULT
>>> =====================
>>> Hostname: shchhv01-sto
>>> Uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
>>> State: Peer Rejected (Connected)
>>>
>>> Upgraded Server:  shchhv01-sto
>>> ==============================
>>> [2017-12-20 05:02:44.747313] I [MSGID: 101190]
>>> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
>>> with
>>> index 1
>>> [2017-12-20 05:02:44.747387] I [MSGID: 101190]
>>> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
>>> with
>>> index 2
>>> [2017-12-20 05:02:44.749087] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
>>> 0-management: RPC_CLNT_PING notify failed
>>> [2017-12-20 05:02:44.749165] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
>>> 0-management: RPC_CLNT_PING notify failed
>>> [2017-12-20 05:02:44.749563] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
>>> 0-management: RPC_CLNT_PING notify failed
>>> [2017-12-20 05:02:54.676324] I [MSGID: 106493]
>>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
>>> RJT
>>> from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272, host: shchhv02-sto, port:
>>> 0
>>> [2017-12-20 05:02:54.690237] I [MSGID: 106163]
>>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
>>> 0-management:
>>> using the op-version 30800
>>> [2017-12-20 05:02:54.695823] I [MSGID: 106490]
>>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
>>> 0-glusterd:
>>> Received probe from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272
>>> [2017-12-20 05:02:54.696956] E [MSGID: 106010]
>>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
>>> Version
>>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
>>> 2747317484 on peer shchhv02-sto
>>> [2017-12-20 05:02:54.697796] I [MSGID: 106493]
>>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
>>> Responded to shchhv02-sto (0), ret: 0, op_ret: -1
>>> [2017-12-20 05:02:55.033822] I [MSGID: 106493]
>>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
>>> RJT
>>> from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b, host: shchhv03-sto, port:
>>> 0
>>> [2017-12-20 05:02:55.038460] I [MSGID: 106163]
>>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
>>> 0-management:
>>> using the op-version 30800
>>> [2017-12-20 05:02:55.040032] I [MSGID: 106490]
>>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
>>> 0-glusterd:
>>> Received probe from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b
>>> [2017-12-20 05:02:55.040266] E [MSGID: 106010]
>>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
>>> Version
>>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
>>> 2747317484 on peer shchhv03-sto
>>> [2017-12-20 05:02:55.040405] I [MSGID: 106493]
>>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
>>> Responded to shchhv03-sto (0), ret: 0, op_ret: -1
>>> [2017-12-20 05:02:55.584854] I [MSGID: 106493]
>>> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
>>> RJT
>>> from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5, host: shchhv04-sto, port:
>>> 0
>>> [2017-12-20 05:02:55.595125] I [MSGID: 106163]
>>> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
>>> 0-management:
>>> using the op-version 30800
>>> [2017-12-20 05:02:55.600804] I [MSGID: 106490]
>>> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
>>> 0-glusterd:
>>> Received probe from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5
>>> [2017-12-20 05:02:55.601288] E [MSGID: 106010]
>>> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
>>> Version
>>> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
>>> 2747317484 on peer shchhv04-sto
>>> [2017-12-20 05:02:55.601497] I [MSGID: 106493]
>>> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
>>> Responded to shchhv04-sto (0), ret: 0, op_ret: -1
>>>
>>> Another Server:  shchhv02-sto
>>> ==============================
>>> [2017-12-20 05:02:44.667833] W
>>> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
>>> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c)
>>> [0x7f75fdc12e5c]
>>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08)
>>> [0x7f75fdc1ca08]
>>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa)
>>> [0x7f75fdcc57fa] ) 0-management: Lock for vol shchst01-sto not held
>>> [2017-12-20 05:02:44.667795] I [MSGID: 106004]
>>> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
>>> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer
>>> Rejected>, has disconnected from glusterd.
>>> [2017-12-20 05:02:44.667948] W [MSGID: 106118]
>>> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock
>>> not
>>> released for shchst01-sto
>>> [2017-12-20 05:02:44.760103] I [MSGID: 106163]
>>> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack]
>>> 0-management:
>>> using the op-version 30800
>>> [2017-12-20 05:02:44.765389] I [MSGID: 106490]
>>> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req]
>>> 0-glusterd:
>>> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
>>> [2017-12-20 05:02:54.686185] E [MSGID: 106010]
>>> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management:
>>> Version
>>> of Cksums shchst01 differ. local cksum = 2747317484, remote cksum =
>>> 4218452135 on peer shchhv01-sto
>>> [2017-12-20 05:02:54.686882] I [MSGID: 106493]
>>> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd:
>>> Responded to shchhv01-sto (0), ret: 0, op_ret: -1
>>> [2017-12-20 05:02:54.717854] I [MSGID: 106493]
>>> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received
>>> RJT
>>> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto, port:
>>> 0
>>>
>>> Another Server:  shchhv04-sto
>>> ==============================
>>> [2017-12-20 05:02:44.667620] I [MSGID: 106004]
>>> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
>>> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer
>>> Rejected>, has disconnected from glusterd.
>>> [2017-12-20 05:02:44.667808] W
>>> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
>>> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c)
>>> [0x7f10a33d9e5c]
>>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08)
>>> [0x7f10a33e3a08]
>>> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa)
>>> [0x7f10a348c7fa] ) 0-management: Lock for vol shchst01-sto not held
>>> [2017-12-20 05:02:44.667827] W [MSGID: 106118]
>>> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock
>>> not
>>> released for shchst01-sto
>>> [2017-12-20 05:02:44.760077] I [MSGID: 106163]
>>> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack]
>>> 0-management:
>>> using the op-version 30800
>>> [2017-12-20 05:02:44.768796] I [MSGID: 106490]
>>> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req]
>>> 0-glusterd:
>>> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
>>> [2017-12-20 05:02:55.595095] E [MSGID: 106010]
>>> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management:
>>> Version
>>> of Cksums shchst01-sto differ. local cksum = 2747317484, remote cksum =
>>> 4218452135 on peer shchhv01-sto
>>> [2017-12-20 05:02:55.595273] I [MSGID: 106493]
>>> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd:
>>> Responded to shchhv01-sto (0), ret: 0, op_ret: -1
>>> [2017-12-20 05:02:55.612957] I [MSGID: 106493]
>>> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received
>>> RJT
>>> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto, port:
>>> 0
>>>
>>> <vol>/info
>>>
>>> Upgraded Server: shchst01-sto
>>> =========================
>>> type=2
>>> count=12
>>> status=1
>>> sub_count=3
>>> stripe_count=1
>>> replica_count=3
>>> disperse_count=0
>>> redundancy_count=0
>>> version=52
>>> transport-type=0
>>> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3
>>> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc
>>> password=58652573-0955-4d00-893a-9f42d0f16717
>>> op-version=30700
>>> client-op-version=30700
>>> quota-version=0
>>> tier-enabled=0
>>> parent_volname=N/A
>>> restored_from_snap=00000000-0000-0000-0000-000000000000
>>> snap-max-hard-limit=256
>>> cluster.data-self-heal-algorithm=full
>>> features.shard-block-size=512MB
>>> features.shard=enable
>>> nfs.disable=on
>>> cluster.self-heal-daemon=on
>>> cluster.server-quorum-type=server
>>> cluster.quorum-type=auto
>>> network.remote-dio=enable
>>> cluster.eager-lock=enable
>>> performance.stat-prefetch=off
>>> performance.io-cache=off
>>> performance.read-ahead=off
>>> performance.quick-read=off
>>> server.allow-insecure=on
>>> storage.owner-gid=9869
>>> storage.owner-uid=9869
>>> performance.readdir-ahead=on
>>> performance.io-thread-count=64
>>> performance.cache-size=1GB
>>> brick-0=shchhv01-sto:-data-brick3-shchst01
>>> brick-1=shchhv02-sto:-data-brick3-shchst01
>>> brick-2=shchhv03-sto:-data-brick3-shchst01
>>> brick-3=shchhv01-sto:-data-brick1-shchst01
>>> brick-4=shchhv02-sto:-data-brick1-shchst01
>>> brick-5=shchhv03-sto:-data-brick1-shchst01
>>> brick-6=shchhv02-sto:-data-brick2-shchst01
>>> brick-7=shchhv03-sto:-data-brick2-shchst01
>>> brick-8=shchhv04-sto:-data-brick2-shchst01
>>> brick-9=shchhv02-sto:-data-brick4-shchst01
>>> brick-10=shchhv03-sto:-data-brick4-shchst01
>>> brick-11=shchhv04-sto:-data-brick4-shchst01
>>>
>>> Another Server:  shchhv02-sto
>>> ==============================
>>> type=2
>>> count=12
>>> status=1
>>> sub_count=3
>>> stripe_count=1
>>> replica_count=3
>>> disperse_count=0
>>> redundancy_count=0
>>> version=52
>>> transport-type=0
>>> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3
>>> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc
>>> password=58652573-0955-4d00-893a-9f42d0f16717
>>> op-version=30700
>>> client-op-version=30700
>>> quota-version=0
>>> parent_volname=N/A
>>> restored_from_snap=00000000-0000-0000-0000-000000000000
>>> snap-max-hard-limit=256
>>> cluster.data-self-heal-algorithm=full
>>> features.shard-block-size=512MB
>>> features.shard=enable
>>> performance.readdir-ahead=on
>>> storage.owner-uid=9869
>>> storage.owner-gid=9869
>>> server.allow-insecure=on
>>> performance.quick-read=off
>>> performance.read-ahead=off
>>> performance.io-cache=off
>>> performance.stat-prefetch=off
>>> cluster.eager-lock=enable
>>> network.remote-dio=enable
>>> cluster.quorum-type=auto
>>> cluster.server-quorum-type=server
>>> cluster.self-heal-daemon=on
>>> nfs.disable=on
>>> performance.io-thread-count=64
>>> performance.cache-size=1GB
>>> brick-0=shchhv01-sto:-data-brick3-shchst01
>>> brick-1=shchhv02-sto:-data-brick3-shchst01
>>> brick-2=shchhv03-sto:-data-brick3-shchst01
>>> brick-3=shchhv01-sto:-data-brick1-shchst01
>>> brick-4=shchhv02-sto:-data-brick1-shchst01
>>> brick-5=shchhv03-sto:-data-brick1-shchst01
>>> brick-6=shchhv02-sto:-data-brick2-shchst01
>>> brick-7=shchhv03-sto:-data-brick2-shchst01
>>> brick-8=shchhv04-sto:-data-brick2-shchst01
>>> brick-9=shchhv02-sto:-data-brick4-shchst01
>>> brick-10=shchhv03-sto:-data-brick4-shchst01
>>> brick-11=shchhv04-sto:-data-brick4-shchst01
>>>
>>> NOTE
>>>
>>> [root@shchhv01 shchst01]# gluster volume get shchst01 cluster.op-version
>>> Warning: Support to get global option value using `volume get <volname>`
>>> will be deprecated from next release. Consider using `volume get all`
>>> instead for global options
>>> Option                                  Value
>>>
>>> ------                                  -----
>>>
>>> cluster.op-version                      30800
>>>
>>> [root@shchhv02 shchst01]# gluster volume get shchst01 cluster.op-version
>>> Option                                  Value
>>>
>>> ------                                  -----
>>>
>>> cluster.op-version                      30800
>>>
>>> -----Original Message-----
>>> From: gluster-users-bounces@xxxxxxxxxxx
>>> [mailto:gluster-users-bounces@xxxxxxxxxxx] On Behalf Of Ziemowit Pierzycki
>>> Sent: Tuesday, December 19, 2017 3:56 PM
>>> To: gluster-users <gluster-users@xxxxxxxxxxx>
>>> Subject: Re:  Upgrading from Gluster 3.8 to 3.12
>>>
>>> I have not done the upgrade yet.  Since this is a production cluster I
>>> need
>>> to make sure it stays up or schedule some downtime if it doesn't doesn't.
>>> Thanks.
>>>
>>> On Tue, Dec 19, 2017 at 10:11 AM, Atin Mukherjee <amukherj@xxxxxxxxxx>
>>> wrote:
>>>>
>>>>
>>>> On Tue, Dec 19, 2017 at 1:10 AM, Ziemowit Pierzycki
>>>> <ziemowit@xxxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a cluster of 10 servers all running Fedora 24 along with
>>>>> Gluster 3.8.  I'm planning on doing rolling upgrades to Fedora 27
>>>>> with Gluster 3.12.  I saw the documentation and did some testing but
>>>>> I would like to run my plan through some (more?) educated minds.
>>>>>
>>>>> The current setup is:
>>>>>
>>>>> Volume Name: vol0
>>>>> Distributed-Replicate
>>>>> Number of Bricks: 2 x (2 + 1) = 6
>>>>> Bricks:
>>>>> Brick1: glt01:/vol/vol0
>>>>> Brick2: glt02:/vol/vol0
>>>>> Brick3: glt05:/vol/vol0 (arbiter)
>>>>> Brick4: glt03:/vol/vol0
>>>>> Brick5: glt04:/vol/vol0
>>>>> Brick6: glt06:/vol/vol0 (arbiter)
>>>>>
>>>>> Volume Name: vol1
>>>>> Distributed-Replicate
>>>>> Number of Bricks: 2 x (2 + 1) = 6
>>>>> Bricks:
>>>>> Brick1: glt07:/vol/vol1
>>>>> Brick2: glt08:/vol/vol1
>>>>> Brick3: glt05:/vol/vol1 (arbiter)
>>>>> Brick4: glt09:/vol/vol1
>>>>> Brick5: glt10:/vol/vol1
>>>>> Brick6: glt06:/vol/vol1 (arbiter)
>>>>>
>>>>> After performing the upgrade because of differences in checksums, the
>>>>> upgraded nodes will become:
>>>>>
>>>>> State: Peer Rejected (Connected)
>>>>
>>>>
>>>> Have you upgraded all the nodes? If yes, have you bumped up the
>>>> cluster.op-version after upgrading all the nodes? Please follow :
>>>> http://docs.gluster.org/en/latest/Upgrade-Guide/op_version/ for more
>>>> details on how to bump up the cluster.op-version. In case you have
>>>> done all of these and you're seeing a checksum issue then I'm afraid
>>>> you have hit a bug. I'd need further details like the checksum
>>>> mismatch error from glusterd.log file along with the the exact
>>>> volume's info file from /var/lib/glusterd/vols/<volname>/info between
>>>> both the peers to debug this further.
>>>>
>>>>>
>>>>> If I start doing the upgrades one at a time, with nodes glt10 to
>>>>> glt01 except for the arbiters glt05 and glt06, and then upgrading the
>>>>> arbiters last, everything should remain online at all times through
>>>>> the process.  Correct?
>>>>>
>>>>> Thanks.
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users@xxxxxxxxxxx
>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users@xxxxxxxxxxx
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users@xxxxxxxxxxx
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
> 
> 
> 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users