On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf@xxxxxxxxx> wrote:
Hi,I am having significant issues with glustershd with releases 8.4 and 9.1.My oVirt clusters are using gluster storage backends, and were running fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I have moved to this release when upgrading my clusters.Since then I am having issues whenever one of the nodes is brought down; when the nodes come back up online the bricks are typically back up and working, but some (random) glustershd processes in the various nodes seem to have issues connecting to some of them.
When the issue happens, can you check if the TCP port number of the brick (glusterfsd) processes displayed in `gluster volume status` matches with that of the actual port numbers observed (i.e. the --brick-port argument) when you run `ps aux | grep glusterfsd` ? If they don't match, then glusterd has incorrect brick port information in its memory and serving it to glustershd. Restarting glusterd instead of (killing the bricks + `volume start force`) should fix it, although we need to find why glusterd serves incorrect port numbers.
If they do match, then can you take a statedump of glustershd to check that it is indeed disconnected from the bricks? You will need to verify that 'connected=1' in the statedump. See "Self-heal is stuck/ not getting completed." section in https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be generated in the /var/run/gluster/ directory.
Regards,
Ravi
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users