Unable to upgrade nodes because of cksums mismatch

Michael Böhm <dudleyperkins@xxxxxxxxx> · Mon, 27 Dec 2021 13:55:10 +0100

Hey guys,
i have a problem upgrading our nodes from 8.3 to 10.0 - i just upgraded the first node and run into "the cksums mismatch" problem. On the upgraded v10 node the checksums for all volumes are different than on the other v8 nodes. That leads to the node starting in a peer rejected state. I can only resolve this by following the actions supposed here:
https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Administrator%20Guide/Resolving%20Peer%20Rejected/
(stopping glusterd, deleting /var/lib/glusterd/* (except glusterd.info), start glusterd, probe a v8 peer, restart glusterd again)

The cluster seems healthy again, self-healing is started and everything looks fine - only the newly created cksums are still different than on the other nodes. That means this healthy state only lasts till i reboot the node - where it all begins from the start - the nodes comes up as peer rejected.

Now i'v read about the problem here:
https://github.com/gluster/glusterfs/issues/1332 (even though that describes the problem should only occur when upgrading from earlier than v7)
or also here on the mailing list:
https://lists.gluster.org/pipermail/gluster-users/2021-November/039679.html (i think i have the same problem, but unfortunately no solution given here)

Solutions seem to require upgrading all nodes and the problem should be resolved when finally upgrading op.version - but i dont' think this approach can be done online, and there's not really a way for me to do this offline.

Why is this happening now and not when i upgraded from pre7 to 7? All my nodes are 8.3 and op.version is 8000.

One thing i might have done "wrong" - as i upgraded to v8 i didn't set "gluster volume set <volname> fips-mode-rchecksum on" on the volumes, i think i just overlooked it in the docs. I have this option only set on 2 volumes i created after upgrading to v8. But even on those 2 the cksums differ, so i guess it wouldn' help alot if i set the option on all other volumes?

I really don't know what to do now, i kinda understand the problem but don't know why this is happening on a overall v8 cluster. I can't take all 9 nodes down, upgrade all to v10 and rely on "it's all good" with the final upgrade of op.version.

Can someone point me in a safe direction?

Regards

Mika

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users