Re: Upgrade 10.4 -> 11.1 making problems

Diego Zuccato <diego.zuccato@xxxxxxxx> · Fri, 19 Jan 2024 07:54:59 +0100

I don't want to hijack the thread. And in my case setting logs to debug 
would fill my /var partitions in no time. Maybe the OP can.

Diego

Il 18/01/2024 22:58, Strahil Nikolov ha scritto:
Are you able to set the logs to debug level ?
It might provide a clue what it is going on.

Best Regards,
Strahil Nikolov

    On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
    <diego.zuccato@xxxxxxxx> wrote:
    That's the same kind of errors I keep seeing on my 2 clusters,
    regenerated some months ago. Seems a pseudo-split-brain that should be
    impossible on a replica 3 cluster but keeps happening.
    Sadly going to ditch Gluster ASAP.

    Diego

    Il 18/01/2024 07:11, Hu Bert ha scritto:
     > Good morning,
     > heal still not running. Pending heals now sum up to 60K per brick.
     > Heal was starting instantly e.g. after server reboot with version
     > 10.4, but doesn't with version 11. What could be wrong?
     >
     > I only see these errors on one of the "good" servers in
    glustershd.log:
     >
     > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031]
     > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
     > remote operation failed.
     > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>},
     > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
     > f00681b}, {errno=2}, {error=No such file or directory}]
     > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031]
     > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
     > remote operation failed.
     > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>},
     > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
     > d94dd11}, {errno=2}, {error=No such file or directory}]
     >
     > About 7K today. Any ideas? Someone?
     >
     >
     > Best regards,
     > Hubert
     >
     > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert
    <revirii@xxxxxxxxxxxxxx <mailto:revirii@xxxxxxxxxxxxxx>>:
     >>
     >> ok, finally managed to get all servers, volumes etc runnung, but
    took
     >> a couple of restarts, cksum checks etc.
     >>
     >> One problem: a volume doesn't heal automatically or doesn't heal
    at all.
     >>
     >> gluster volume status
     >> Status of volume: workdata
     >> Gluster process                            TCP Port  RDMA Port 
    Online  Pid
     >>
    ------------------------------------------------------------------------------
     >> Brick glusterpub1:/gluster/md3/workdata    58832    0         
    Y      3436
     >> Brick glusterpub2:/gluster/md3/workdata    59315    0         
    Y      1526
     >> Brick glusterpub3:/gluster/md3/workdata    56917    0         
    Y      1952
     >> Brick glusterpub1:/gluster/md4/workdata    59688    0         
    Y      3755
     >> Brick glusterpub2:/gluster/md4/workdata    60271    0         
    Y      2271
     >> Brick glusterpub3:/gluster/md4/workdata    49461    0         
    Y      2399
     >> Brick glusterpub1:/gluster/md5/workdata    54651    0         
    Y      4208
     >> Brick glusterpub2:/gluster/md5/workdata    49685    0         
    Y      2751
     >> Brick glusterpub3:/gluster/md5/workdata    59202    0         
    Y      2803
     >> Brick glusterpub1:/gluster/md6/workdata    55829    0         
    Y      4583
     >> Brick glusterpub2:/gluster/md6/workdata    50455    0         
    Y      3296
     >> Brick glusterpub3:/gluster/md6/workdata    50262    0         
    Y      3237
     >> Brick glusterpub1:/gluster/md7/workdata    52238    0         
    Y      5014
     >> Brick glusterpub2:/gluster/md7/workdata    52474    0         
    Y      3673
     >> Brick glusterpub3:/gluster/md7/workdata    57966    0         
    Y      3653
     >> Self-heal Daemon on localhost              N/A      N/A       
    Y      4141
     >> Self-heal Daemon on glusterpub1            N/A      N/A       
    Y      5570
     >> Self-heal Daemon on glusterpub2            N/A      N/A       
    Y      4139
     >>
     >> "gluster volume heal workdata info" lists a lot of files per brick.
     >> "gluster volume heal workdata statistics heal-count" shows thousands
     >> of files per brick.
     >> "gluster volume heal workdata enable" has no effect.
     >>
     >> gluster volume heal workdata full
     >> Launching heal operation to perform full self heal on volume
    workdata
     >> has been successful
     >> Use heal info commands to check status.
     >>
     >> -> not doing anything at all. And nothing happening on the 2 "good"
     >> servers in e.g. glustershd.log. Heal was working as expected on
     >> version 10.4, but here... silence. Someone has an idea?
     >>
     >>
     >> Best regards,
     >> Hubert
     >>
     >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
     >> <gilberto.nunes32@xxxxxxxxx <mailto:gilberto.nunes32@xxxxxxxxx>>:
     >>>
     >>> Ah! Indeed! You need to perform an upgrade in the clients as well.
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert
    <revirii@xxxxxxxxxxxxxx <mailto:revirii@xxxxxxxxxxxxxx>> escreveu:
     >>>>
     >>>> morning to those still reading :-)
     >>>>
     >>>> i found this:
    https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them <https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them>
     >>>>
     >>>> there's a paragraph about "peer rejected" with the same error
    message,
     >>>> telling me: "Update the cluster.op-version" - i had only
    updated the
     >>>> server nodes, but not the clients. So upgrading the
    cluster.op-version
     >>>> wasn't possible at this time. So... upgrading the clients to
    version
     >>>> 11.1 and then the op-version should solve the problem?
     >>>>
     >>>>
     >>>> Thx,
     >>>> Hubert
     >>>>
     >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert
    <revirii@xxxxxxxxxxxxxx <mailto:revirii@xxxxxxxxxxxxxx>>:
     >>>>>
     >>>>> Hi,
     >>>>> just upgraded some gluster servers from version 10.4 to
    version 11.1.
     >>>>> Debian bullseye & bookworm. When only installing the
    packages: good,
     >>>>> servers, volumes etc. work as expected.
     >>>>>
     >>>>> But one needs to test if the systems work after a daemon
    and/or server
     >>>>> restart. Well, did a reboot, and after that the
    rebooted/restarted
     >>>>> system is "out". Log message from working node:
     >>>>>
     >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163]
     >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
     >>>>> 0-management: using the op-version 100000
     >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490]
     >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
     >>>>> 0-glusterd: Received probe from uuid:
     >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e
     >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010]
     >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume]
    0-management:
     >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525,
     >>>>> remote cksum = 1931483801 on peer gluster190
     >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493]
     >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp]
    0-glusterd:
     >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1
     >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493]
     >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd:
     >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e,
    host:
     >>>>> gluster190, port: 0
     >>>>>
     >>>>> peer status from rebooted node:
     >>>>>
     >>>>> root@gluster190 <mailto:root@gluster190> ~ # gluster peer status
     >>>>> Number of Peers: 2
     >>>>>
     >>>>> Hostname: gluster189
     >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
     >>>>> State: Peer Rejected (Connected)
     >>>>>
     >>>>> Hostname: gluster188
     >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
     >>>>> State: Peer Rejected (Connected)
     >>>>>
     >>>>> So the rebooted gluster190 is not accepted anymore. And thus
    does not
     >>>>> appear in "gluster volume status". I then followed this guide:
     >>>>>
     >>>>>
    https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ <https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/>
     >>>>>
     >>>>> Remove everything under /var/lib/glusterd/ (except
    glusterd.info) and
     >>>>> restart glusterd service etc. Data get copied from other nodes,
     >>>>> 'gluster peer status' is ok again - but the volume info is
    missing,
     >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from
    another
     >>>>> node, the volume then is available again, heals start etc.
     >>>>>
     >>>>> Well, and just to be sure that everything's working as it should,
     >>>>> rebooted that node again - the rebooted node is kicked out
    again, and
     >>>>> you have to restart bringing it back again.
     >>>>>
     >>>>> Sry, but did i miss anything? Has someone experienced similar
     >>>>> problems? I'll probably downgrade to 10.4 again, that version was
     >>>>> working...
     >>>>>
     >>>>>
     >>>>> Thx,
     >>>>> Hubert
     >>>> ________
     >>>>
     >>>>
     >>>>
     >>>> Community Meeting Calendar:
     >>>>
     >>>> Schedule -
     >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
     >>>> Bridge: https://meet.google.com/cpu-eiue-hvk
    <https://meet.google.com/cpu-eiue-hvk>
     >>>> Gluster-users mailing list
     >>>> Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
     >>>> https://lists.gluster.org/mailman/listinfo/gluster-users
    <https://lists.gluster.org/mailman/listinfo/gluster-users>
     > ________
     >
     >
     >
     > Community Meeting Calendar:
     >
     > Schedule -
     > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
     > Bridge: https://meet.google.com/cpu-eiue-hvk
    <https://meet.google.com/cpu-eiue-hvk>
     > Gluster-users mailing list
     > Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
     > https://lists.gluster.org/mailman/listinfo/gluster-users
    <https://lists.gluster.org/mailman/listinfo/gluster-users>

    -- 
    Diego Zuccato
    DIFA - Dip. di Fisica e Astronomia
    Servizi Informatici
    Alma Mater Studiorum - Università di Bologna
    V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
    tel.: +39 051 20 95786

    ________

    Community Meeting Calendar:

    Schedule -
    Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
    Bridge: https://meet.google.com/cpu-eiue-hvk
    <https://meet.google.com/cpu-eiue-hvk>
    Gluster-users mailing list
    Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
    https://lists.gluster.org/mailman/listinfo/gluster-users
    <https://lists.gluster.org/mailman/listinfo/gluster-users>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users