Re: Upgrade 10.4 -> 11.1 making problems

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Tue, 30 Jan 2024 06:14:37 +0000 (UTC)

This is your problem : bad server has only 3 clients.
I remember there is another gluster volume command to list the IPs of the clients. Find it and run it to find which clients are actually OK (those 3) and the remaining 17 are not. 

Then try to remount those 17 clients and if the situation persistes - work with your Network Team to identify why the 17 clients can't reach the brick.

Do you have selfheal enabled?
cluster.data-self-heal
cluster.entry-self-heal
cluster.metadata-self-heal

Best Regards,
Strahil Nikolov
   On Mon, Jan 29, 2024 at 10:26, Hu Bert
<revirii@xxxxxxxxxxxxxx> wrote:

  Hi,
not sure what you mean with "clients" - do you mean the clients that
mount the volume?

gluster volume status workdata clients
----------------------------------------------
Brick : glusterpub2:/gluster/md3/workdata
Clients connected : 20
Hostname                                               BytesRead
BytesWritten       OpVersion
--------                                               ---------
------------       ---------
192.168.0.222:49140                                     43698212
 41152108          110000
[...shortened...]
192.168.0.126:49123                                   8362352021
16445401205          110000
----------------------------------------------
Brick : glusterpub3:/gluster/md3/workdata
Clients connected : 3
Hostname                                               BytesRead
BytesWritten       OpVersion
--------                                               ---------
------------       ---------
192.168.0.44:49150                                  5855740279
63649538575          110000
192.168.0.44:49137                                   308958200
319216608          110000
192.168.0.126:49120                                   7524915770
15489813449          110000

192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean
by "old" - probably not the age of the server, but rather the gluster
version. op-version is 110000 on all servers+clients, upgraded from
10.4 -> 11.1

"Have you checked if a client is not allowed to update all 3 copies ?"
-> are there special log messages for that?

"If it's only 1 system, you can remove the brick, reinitialize it and
then bring it back for a full sync."
-> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick
-> Replacing bricks in Replicate/Distributed Replicate volumes

this part, right? Well, can't do this right now, as there are ~33TB of
data (many small files) to copy, that would slow down the servers /
the volume. But if the replacement is running i could do it
afterwards, just to see what happens.

Hubert

Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov
<hunter86_bg@xxxxxxxxx>:
>
> 2800 is too much. Most probably you are affected by a bug. How old are the clients ? Is only 1 server affected ?
> Have you checked if a client is not allowed to update all 3 copies ?
>
> If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync.
>
> Best Regards,
> Strahil Nikolov
>
> On Mon, Jan 29, 2024 at 8:44, Hu Bert
> <revirii@xxxxxxxxxxxxxx> wrote:
> Morning,
> a few bad apples - but which ones? Checked glustershd.log on the "bad"
> server and counted todays "gfid mismatch" entries (2800 in total):
>
>     44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/212>,
>     44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/174>,
>     44 <gfid:d5c6d7b9-f217-4cc9-a664-448d034e74c2>/94037803>,
>     44 <gfid:d263ecc2-9c21-455c-9ba9-5a999c03adce>/94066216>,
>     44 <gfid:cbfd5d46-d580-4845-a544-e46fd82c1758>/249771609>,
>     44 <gfid:aecf217a-0797-43d1-9481-422a8ac8a5d0>/64235523>,
>     44 <gfid:a701d47b-b3fb-4e7e-bbfb-bc3e19632867>/185>,
>
> etc. But as i said, these are pretty new and didn't appear when the
> volume/servers started missbehaving. Are there scripts/snippets
> available how one could handle this?
>
> Healing would be very painful for the running system (still connected,
> but not very long anymore), as there surely are 4-5 million entries to
> be healed. I can't do this now - maybe, when the replacement is in
> productive state, one could give it a try.
>
> Thx,
> Hubert
>
> Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov
> <hunter86_bg@xxxxxxxxx>:
> >
> > From gfid mismatch a manual effort is needed but you can script it.
> > I think that a few bad "apples" can break the healing and if you fix them the healing might be recovered.
> >
> > Also, check why the client is not updating all copies. Most probably you have a client that is not able to connect to a brick.
> >
> > gluster volume status VOLUME_NAME clients
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sun, Jan 28, 2024 at 20:55, Hu Bert
> > <revirii@xxxxxxxxxxxxxx> wrote:
> > Hi Strahil,
> > there's no arbiter: 3 servers with 5 bricks each.
> >
> > Volume Name: workdata
> > Type: Distributed-Replicate
> > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 5 x 3 = 15
> >
> > The "problem" is: the number of files/entries to-be-healed has
> > continuously grown since the beginning, and now we're talking about
> > way too many files to do this manually. Last time i checked: 700K per
> > brick, should be >900K at the moment. The command 'gluster volume heal
> > workdata statistics heal-count' is unable to finish. Doesn't look that
> > good :D
> >
> > Interesting, the glustershd.log on the "bad" server now shows errors like these:
> >
> > [2024-01-28 18:48:33.734053 +0000] E [MSGID: 108008]
> > [afr-self-heal-common.c:399:afr_gfid_split_brain_source]
> > 0-workdata-replicate-3: Gfid mismatch detected for
> > <gfid:70ab3d57-bd82-4932-86bf-d613db32c1ab>/803620716>,
> > 82d7939a-8919-40ea-
> > 9459-7b8af23d3b72 on workdata-client-11 and
> > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9
> >
> > Shouldn't the heals happen on the 2 "good" servers?
> >
> > Anyway... we're currently preparing a different solution for our data
> > and we'll throw away this gluster volume - no critical data will be
> > lost, as these are derived from source data (on a different volume on
> > different servers). Will be a hard time (calculating tons of data),
> > but the chosen solution should have a way better performance.
> >
> > Well... thx to all for your efforts, really appreciate that :-)
> >
> >
> > Hubert
> >
> > Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov
> > <hunter86_bg@xxxxxxxxx>:
> > >
> > > What about the arbiter node ?
> > > Actually, check on all nodes and script it - you might need it in the future.
> > >
> > > Simplest way to resolve is to make the file didappear (rename to something else and then rename it back). Another easy trick is to read thr whole file: dd if=file of=/dev/null status=progress
> > >
> > > Best Regards,
> > > Strahil Nikolov
> > >
> > > On Sat, Jan 27, 2024 at 8:24, Hu Bert
> > > <revirii@xxxxxxxxxxxxxx> wrote:
> > > Morning,
> > >
> > > gfid1:
> > > getfattr -d -e hex -m.
> > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > >
> > > glusterpub1 (good one):
> > > getfattr: Removing leading '/' from absolute path names
> > > # file: gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > trusted.afr.dirty=0x000000000000000000000000
> > > trusted.afr.workdata-client-11=0x000000020000000100000000
> > > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
> > > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
> > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecff000000002695ebb70000000065aaecff000000002695ebb70000000065aaecff000000002533f110
> > >
> > > glusterpub3 (bad one):
> > > getfattr: /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
> > > No such file or directory
> > >
> > > gfid 2:
> > > getfattr -d -e hex -m.
> > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > >
> > > glusterpub1 (good one):
> > > getfattr: Removing leading '/' from absolute path names
> > > # file: gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > trusted.afr.dirty=0x000000000000000000000000
> > > trusted.afr.workdata-client-8=0x000000020000000100000000
> > > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
> > > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067
> > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecfe000000000c5403bd0000000065aaecfe000000000c5403bd0000000065aaecfe000000000ad61ee4
> > >
> > > glusterpub3 (bad one):
> > > getfattr: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642:
> > > No such file or directory
> > >
> > > thx,
> > > Hubert
> > >
> > > Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov
> > > <hunter86_bg@xxxxxxxxx>:
> > > >
> > > > You don't need to mount it.
> > > > Like this :
> > > > # getfattr -d -e hex -m. /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e
> > > > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e
> > > > trusted.gfid=0x00462be83e6149318bdadae1645c639e
> > > > trusted.gfid2path.05fcbdafdeea18ab=0x30326333373930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079
> > > > trusted.glusterfs.mdata=0x010000000000000000000000006170340c0000000025b6a745000000006170340c0000000020efb577000000006170340c0000000020d42b07
> > > > trusted.glusterfs.shard.block-size=0x0000000004000000
> > > > trusted.glusterfs.shard.file-size=0x00000000000000cd000000000000000000000000000000010000000000000000
> > > >
> > > >
> > > > Best Regards,
> > > > Strahil Nikolov
> > > >
> > > >
> > > >
> > > > В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert <revirii@xxxxxxxxxxxxxx> написа:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Good morning,
> > > >
> > > > hope i got it right... using:
> > > > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02
> > > >
> > > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata
> > > >
> > > > gfid 1:
> > > > getfattr -n trusted.glusterfs.pathinfo -e text
> > > > /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > getfattr: Removing leading '/' from absolute path names
> > > > # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht>
> > > > (<REPLICATE:workdata-replicate-3>
> > > > <POSIX(/gluster/md6/workdata):glusterpub1:/gluster/md6/workdata/images/133/283/13328349/128x128s.jpg>
> > > > <POSIX(/gluster/md6/workdata):glusterpub2:/gl
> > > > uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))"
> > > >
> > > > gfid 2:
> > > > getfattr -n trusted.glusterfs.pathinfo -e text
> > > > /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > getfattr: Removing leading '/' from absolute path names
> > > > # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht>
> > > > (<REPLICATE:workdata-replicate-2>
> > > > <POSIX(/gluster/md5/workdata):glusterpub2:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>
> > > > <POSIX(/gluster/md5/workdata
> > > > ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))"
> > > >
> > > > glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the
> > > > misbehaving (not healing) one.
> > > >
> > > > The file with gfid 1 is available under
> > > > /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2
> > > > bricks, but missing on glusterpub3 brick.
> > > >
> > > > gfid 2: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > is present on glusterpub1+2, but not on glusterpub3.
> > > >
> > > >
> > > > Thx,
> > > > Hubert
> > > >
> > > > Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov
> > > > <hunter86_bg@xxxxxxxxx>:
> > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > Can you find and check the files with gfids:
> > > > > 60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > > faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > >
> > > > > Use 'getfattr -d -e hex -m. ' command from https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output .
> > > > >
> > > > > Best Regards,
> > > > > Strahil Nikolov
> > > > >
> > > > > On Sat, Jan 20, 2024 at 9:44, Hu Bert
> > > > > <revirii@xxxxxxxxxxxxxx> wrote:
> > > > > Good morning,
> > > > >
> > > > > thx Gilberto, did the first three (set to WARNING), but the last one
> > > > > doesn't work. Anyway, with setting these three some new messages
> > > > > appear:
> > > > >
> > > > > [2024-01-20 07:23:58.561106 +0000] W [MSGID: 114061]
> > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd
> > > > > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > [2024-01-20 07:23:58.561177 +0000] E [MSGID: 108028]
> > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3:
> > > > > Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor
> > > > > in bad state]
> > > > > [2024-01-20 07:23:58.562151 +0000] W [MSGID: 114031]
> > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11:
> > > > > remote operation failed.
> > > > > [{path=<gfid:faf59566-10f5-4ddd-8b0c-a87bc6a334fb>},
> > > > > {gfid=faf59566-10f5-4ddd-8b0c-a87b
> > > > > c6a334fb}, {errno=2}, {error=No such file or directory}]
> > > > > [2024-01-20 07:23:58.562296 +0000] W [MSGID: 114061]
> > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11:
> > > > > remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > [2024-01-20 07:23:58.860552 +0000] W [MSGID: 114061]
> > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd
> > > > > is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > [2024-01-20 07:23:58.860608 +0000] E [MSGID: 108028]
> > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2:
> > > > > Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor
> > > > > in bad state]
> > > > > [2024-01-20 07:23:58.861520 +0000] W [MSGID: 114031]
> > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8:
> > > > > remote operation failed.
> > > > > [{path=<gfid:60465723-5dc0-4ebe-aced-9f2c12e52642>},
> > > > > {gfid=60465723-5dc0-4ebe-aced-9f2c1
> > > > > 2e52642}, {errno=2}, {error=No such file or directory}]
> > > > > [2024-01-20 07:23:58.861640 +0000] W [MSGID: 114061]
> > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8:
> > > > > remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > >
> > > > > Not many log entries appear, only a few. Has someone seen error
> > > > > messages like these? Setting diagnostics.brick-sys-log-level to DEBUG
> > > > > shows way more log entries, uploaded it to:
> > > > > https://file.io/spLhlcbMCzr8 - not sure if that helps.
> > > > >
> > > > >
> > > > > Thx,
> > > > > Hubert
> > > > >
> > > > > Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira
> > > > > <gilberto.nunes32@xxxxxxxxx>:
> > > > >
> > > > > >
> > > > > > gluster volume set testvol diagnostics.brick-log-level WARNING
> > > > > > gluster volume set testvol diagnostics.brick-sys-log-level WARNING
> > > > > > gluster volume set testvol diagnostics.client-log-level ERROR
> > > > > > gluster --log-level=ERROR volume status
> > > > > >
> > > > > > ---
> > > > > > Gilberto Nunes Ferreira
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Em sex., 19 de jan. de 2024 às 05:49, Hu Bert <revirii@xxxxxxxxxxxxxx> escreveu:
> > > > > >>
> > > > > >> Hi Strahil,
> > > > > >> hm, don't get me wrong, it may sound a bit stupid, but... where do i
> > > > > >> set the log level? Using debian...
> > > > > >>
> > > > > >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
> > > > > >>
> > > > > >> ls /etc/glusterfs/
> > > > > >> eventsconfig.json  glusterfs-georep-logrotate
> > > > > >> gluster-rsyslog-5.8.conf  group-db-workload      group-gluster-block
> > > > > >>  group-nl-cache  group-virt.example  logger.conf.example
> > > > > >> glusterd.vol      glusterfs-logrotate
> > > > > >> gluster-rsyslog-7.2.conf  group-distributed-virt  group-metadata-cache
> > > > > >>  group-samba    gsyncd.conf        thin-arbiter.vol
> > > > > >>
> > > > > >> checked: /etc/glusterfs/logger.conf.example
> > > > > >>
> > > > > >> # To enable enhanced logging capabilities,
> > > > > >> #
> > > > > >> # 1. rename this file to /etc/glusterfs/logger.conf
> > > > > >> #
> > > > > >> # 2. rename /etc/rsyslog.d/gluster.conf.example to
> > > > > >> #    /etc/rsyslog.d/gluster.conf
> > > > > >> #
> > > > > >> # This change requires restart of all gluster services/volumes and
> > > > > >> # rsyslog.
> > > > > >>
> > > > > >> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' "
> > > > > >>
> > > > > >> restart glusterd on that node, but this doesn't work, log-level stays
> > > > > >> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably
> > > > > >> /etc/rsyslog.conf on debian. But first it would be better to know
> > > > > >> where to set the log-level for glusterd.
> > > > > >>
> > > > > >> Depending on how much the DEBUG log-level talks ;-) i could assign up
> > > > > >> to 100G to /var
> > > > > >>
> > > > > >>
> > > > > >> Thx & best regards,
> > > > > >> Hubert
> > > > > >>
> > > > > >>
> > > > > >> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov
> > > > > >> <hunter86_bg@xxxxxxxxx>:
> > > > > >> >
> > > > > >> > Are you able to set the logs to debug level ?
> > > > > >> > It might provide a clue what it is going on.
> > > > > >> >
> > > > > >> > Best Regards,
> > > > > >> > Strahil Nikolov
> > > > > >> >
> > > > > >> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
> > > > > >> > <diego.zuccato@xxxxxxxx> wrote:
> > > > > >> > That's the same kind of errors I keep seeing on my 2 clusters,
> > > > > >> > regenerated some months ago. Seems a pseudo-split-brain that should be
> > > > > >> > impossible on a replica 3 cluster but keeps happening.
> > > > > >> > Sadly going to ditch Gluster ASAP.
> > > > > >> >
> > > > > >> > Diego
> > > > > >> >
> > > > > >> > Il 18/01/2024 07:11, Hu Bert ha scritto:
> > > > > >> > > Good morning,
> > > > > >> > > heal still not running. Pending heals now sum up to 60K per brick.
> > > > > >> > > Heal was starting instantly e.g. after server reboot with version
> > > > > >> > > 10.4, but doesn't with version 11. What could be wrong?
> > > > > >> > >
> > > > > >> > > I only see these errors on one of the "good" servers in glustershd.log:
> > > > > >> > >
> > > > > >> > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031]
> > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> > > > > >> > > remote operation failed.
> > > > > >> > > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>},
> > > > > >> > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> > > > > >> > > f00681b}, {errno=2}, {error=No such file or directory}]
> > > > > >> > > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031]
> > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> > > > > >> > > remote operation failed.
> > > > > >> > > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>},
> > > > > >> > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> > > > > >> > > d94dd11}, {errno=2}, {error=No such file or directory}]
> > > > > >> > >
> > > > > >> > > About 7K today. Any ideas? Someone?
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Best regards,
> > > > > >> > > Hubert
> > > > > >> > >
> > > > > >> > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii@xxxxxxxxxxxxxx>:
> > > > > >> > >>
> > > > > >> > >> ok, finally managed to get all servers, volumes etc runnung, but took
> > > > > >> > >> a couple of restarts, cksum checks etc.
> > > > > >> > >>
> > > > > >> > >> One problem: a volume doesn't heal automatically or doesn't heal at all.
> > > > > >> > >>
> > > > > >> > >> gluster volume status
> > > > > >> > >> Status of volume: workdata
> > > > > >> > >> Gluster process                            TCP Port  RDMA Port  Online  Pid
> > > > > >> > >> ------------------------------------------------------------------------------
> > > > > >> > >> Brick glusterpub1:/gluster/md3/workdata    58832    0          Y      3436
> > > > > >> > >> Brick glusterpub2:/gluster/md3/workdata    59315    0          Y      1526
> > > > > >> > >> Brick glusterpub3:/gluster/md3/workdata    56917    0          Y      1952
> > > > > >> > >> Brick glusterpub1:/gluster/md4/workdata    59688    0          Y      3755
> > > > > >> > >> Brick glusterpub2:/gluster/md4/workdata    60271    0          Y      2271
> > > > > >> > >> Brick glusterpub3:/gluster/md4/workdata    49461    0          Y      2399
> > > > > >> > >> Brick glusterpub1:/gluster/md5/workdata    54651    0          Y      4208
> > > > > >> > >> Brick glusterpub2:/gluster/md5/workdata    49685    0          Y      2751
> > > > > >> > >> Brick glusterpub3:/gluster/md5/workdata    59202    0          Y      2803
> > > > > >> > >> Brick glusterpub1:/gluster/md6/workdata    55829    0          Y      4583
> > > > > >> > >> Brick glusterpub2:/gluster/md6/workdata    50455    0          Y      3296
> > > > > >> > >> Brick glusterpub3:/gluster/md6/workdata    50262    0          Y      3237
> > > > > >> > >> Brick glusterpub1:/gluster/md7/workdata    52238    0          Y      5014
> > > > > >> > >> Brick glusterpub2:/gluster/md7/workdata    52474    0          Y      3673
> > > > > >> > >> Brick glusterpub3:/gluster/md7/workdata    57966    0          Y      3653
> > > > > >> > >> Self-heal Daemon on localhost              N/A      N/A        Y      4141
> > > > > >> > >> Self-heal Daemon on glusterpub1            N/A      N/A        Y      5570
> > > > > >> > >> Self-heal Daemon on glusterpub2            N/A      N/A        Y      4139
> > > > > >> > >>
> > > > > >> > >> "gluster volume heal workdata info" lists a lot of files per brick.
> > > > > >> > >> "gluster volume heal workdata statistics heal-count" shows thousands
> > > > > >> > >> of files per brick.
> > > > > >> > >> "gluster volume heal workdata enable" has no effect.
> > > > > >> > >>
> > > > > >> > >> gluster volume heal workdata full
> > > > > >> > >> Launching heal operation to perform full self heal on volume workdata
> > > > > >> > >> has been successful
> > > > > >> > >> Use heal info commands to check status.
> > > > > >> > >>
> > > > > >> > >> -> not doing anything at all. And nothing happening on the 2 "good"
> > > > > >> > >> servers in e.g. glustershd.log. Heal was working as expected on
> > > > > >> > >> version 10.4, but here... silence. Someone has an idea?
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> Best regards,
> > > > > >> > >> Hubert
> > > > > >> > >>
> > > > > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
> > > > > >> > >> <gilberto.nunes32@xxxxxxxxx>:
> > > > > >> > >>>
> > > > > >> > >>> Ah! Indeed! You need to perform an upgrade in the clients as well.
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>>
> > > > > >> > >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert <revirii@xxxxxxxxxxxxxx> escreveu:
> > > > > >> > >>>>
> > > > > >> > >>>> morning to those still reading :-)
> > > > > >> > >>>>
> > > > > >> > >>>> i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
> > > > > >> > >>>>
> > > > > >> > >>>> there's a paragraph about "peer rejected" with the same error message,
> > > > > >> > >>>> telling me: "Update the cluster.op-version" - i had only updated the
> > > > > >> > >>>> server nodes, but not the clients. So upgrading the cluster.op-version
> > > > > >> > >>>> wasn't possible at this time. So... upgrading the clients to version
> > > > > >> > >>>> 11.1 and then the op-version should solve the problem?
> > > > > >> > >>>>
> > > > > >> > >>>>
> > > > > >> > >>>> Thx,
> > > > > >> > >>>> Hubert
> > > > > >> > >>>>
> > > > > >> > >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii@xxxxxxxxxxxxxx>:
> > > > > >> > >>>>>
> > > > > >> > >>>>> Hi,
> > > > > >> > >>>>> just upgraded some gluster servers from version 10.4 to version 11.1.
> > > > > >> > >>>>> Debian bullseye & bookworm. When only installing the packages: good,
> > > > > >> > >>>>> servers, volumes etc. work as expected.
> > > > > >> > >>>>>
> > > > > >> > >>>>> But one needs to test if the systems work after a daemon and/or server
> > > > > >> > >>>>> restart. Well, did a reboot, and after that the rebooted/restarted
> > > > > >> > >>>>> system is "out". Log message from working node:
> > > > > >> > >>>>>
> > > > > >> > >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163]
> > > > > >> > >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> > > > > >> > >>>>> 0-management: using the op-version 100000
> > > > > >> > >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490]
> > > > > >> > >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
> > > > > >> > >>>>> 0-glusterd: Received probe from uuid:
> > > > > >> > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e
> > > > > >> > >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010]
> > > > > >> > >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management:
> > > > > >> > >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525,
> > > > > >> > >>>>> remote cksum = 1931483801 on peer gluster190
> > > > > >> > >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493]
> > > > > >> > >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd:
> > > > > >> > >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1
> > > > > >> > >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493]
> > > > > >> > >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd:
> > > > > >> > >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host:
> > > > > >> > >>>>> gluster190, port: 0
> > > > > >> > >>>>>
> > > > > >> > >>>>> peer status from rebooted node:
> > > > > >> > >>>>>
> > > > > >> > >>>>> root@gluster190 ~ # gluster peer status
> > > > > >> > >>>>> Number of Peers: 2
> > > > > >> > >>>>>
> > > > > >> > >>>>> Hostname: gluster189
> > > > > >> > >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
> > > > > >> > >>>>> State: Peer Rejected (Connected)
> > > > > >> > >>>>>
> > > > > >> > >>>>> Hostname: gluster188
> > > > > >> > >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
> > > > > >> > >>>>> State: Peer Rejected (Connected)
> > > > > >> > >>>>>
> > > > > >> > >>>>> So the rebooted gluster190 is not accepted anymore. And thus does not
> > > > > >> > >>>>> appear in "gluster volume status". I then followed this guide:
> > > > > >> > >>>>>
> > > > > >> > >>>>> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
> > > > > >> > >>>>>
> > > > > >> > >>>>> Remove everything under /var/lib/glusterd/ (except glusterd.info) and
> > > > > >> > >>>>> restart glusterd service etc. Data get copied from other nodes,
> > > > > >> > >>>>> 'gluster peer status' is ok again - but the volume info is missing,
> > > > > >> > >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from another
> > > > > >> > >>>>> node, the volume then is available again, heals start etc.
> > > > > >> > >>>>>
> > > > > >> > >>>>> Well, and just to be sure that everything's working as it should,
> > > > > >> > >>>>> rebooted that node again - the rebooted node is kicked out again, and
> > > > > >> > >>>>> you have to restart bringing it back again.
> > > > > >> > >>>>>
> > > > > >> > >>>>> Sry, but did i miss anything? Has someone experienced similar
> > > > > >> > >>>>> problems? I'll probably downgrade to 10.4 again, that version was
> > > > > >> > >>>>> working...
> > > > > >> > >>>>>
> > > > > >> > >>>>>
> > > > > >> > >>>>> Thx,
> > > > > >> > >>>>> Hubert
> > > > > >> > >>>> ________
> > > > > >> > >>>>
> > > > > >> > >>>>
> > > > > >> > >>>>
> > > > > >> > >>>> Community Meeting Calendar:
> > > > > >> > >>>>
> > > > > >> > >>>> Schedule -
> > > > > >> > >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > >> > >>>> Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > >> > >>>> Gluster-users mailing list
> > > > > >> > >>>> Gluster-users@xxxxxxxxxxx
> > > > > >> > >>>> https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > >> > > ________
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Community Meeting Calendar:
> > > > > >> > >
> > > > > >> > > Schedule -
> > > > > >> > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > >> > > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > >> > > Gluster-users mailing list
> > > > > >> > > Gluster-users@xxxxxxxxxxx
> > > > > >> > > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > >> >
> > > > > >> > --
> > > > > >> > Diego Zuccato
> > > > > >> > DIFA - Dip. di Fisica e Astronomia
> > > > > >> > Servizi Informatici
> > > > > >> > Alma Mater Studiorum - Università di Bologna
> > > > > >> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> > > > > >> > tel.: +39 051 20 95786
> > > > > >> >
> > > > > >> > ________
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Community Meeting Calendar:
> > > > > >> >
> > > > > >> > Schedule -
> > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > >> > Gluster-users mailing list
> > > > > >> > Gluster-users@xxxxxxxxxxx
> > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > >> >
> > > > > >> > ________
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Community Meeting Calendar:
> > > > > >> >
> > > > > >> > Schedule -
> > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > >> > Gluster-users mailing list
> > > > > >> > Gluster-users@xxxxxxxxxxx
> > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > >> ________
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Community Meeting Calendar:
> > > > > >>
> > > > > >> Schedule -
> > > > > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > >> Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > >> Gluster-users mailing list
> > > > > >> Gluster-users@xxxxxxxxxxx
> > > > > >> https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users