Re: GlusterFS 9.3 - Replicate Volume (2 Bricks / 1 Arbiter) - Self-healing does not always work

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Sat, 30 Oct 2021 12:45:14 +0000 (UTC)

Can you find the actual file on the brick via https://docs.gluster.org/en/latest/Troubleshooting/gfid-to-path/ ? Usually I use method2 in such cases.
Then check the extended attributes on all bricks (inclusing the arbiter):

getfattr -d -e hex -m.  /gluster/brick/path/to/filr/or/dir

Also, check glustershd.log on 192.168.1.51 & 192.168.1.40 for clues.

Best Regards,
Strahil Nikolov

   On Fri, Oct 29, 2021 at 9:58, Thorsten Walk
<darkiop@xxxxxxxxx> wrote:

  Hello GlusterFS Community,
I am using GlusterFS version 9.3 on two Intel NUCs and a Raspberry PI as arbiter for a replicate volume. The whole thing serves me as distributed storage for a Proxmox cluster.

I use version 9.3, because I could not find a more recent ARM package for the RPI (= Debian 11).

The partions for the volume:

NUC1
nvme0n1                      259:0    0 465.8G  0 disk
└─vg_glusterfs-lv_glusterfs  253:18   0 465.8G  0 lvm  /data/glusterfs

NUC2
nvme0n1                      259:0    0 465.8G  0 disk
└─vg_glusterfs-lv_glusterfs  253:14   0 465.8G  0 lvm  /data/glusterfs

RPI
sda           8:0    1 29,8G  0 disk
└─sda1        8:1    1 29,8G  0 part /data/glusterfs

The volume was created with:

mkfs.xfs -f -i size=512 -n size=8192 -d su=128K,sw=10 -L GlusterFS /dev/vg_glusterfs/lv_glusterfs

gluster volume create glusterfs-1-volume transport tcp replica 3 arbiter 1 192.168.1.50:/data/glusterfs 192.168.1.51:/data/glusterfs 192.168.1.40:/data/glusterfs force

After a certain time it always comes to the state that there are not healable files in the GFS (in the example below: <gfid:26c5396c-86ff-408d-9cda-106acd2b0768>).

Currently I have the GlusterFS volume in test mode and only 1-2 VMs running on it. So far there are no negative effects. The replication and the selfheal basically work, only now and then something remains that cannot be healed.

Does anyone have an idea how to prevent or heal this? I have already completely rebuilt the volume incl. partitions and glusterd to exclude old loads.

If you need more information, please contact me.

Thanks a lot!

================

And here is some more info about the volume and the healing attempts:

>$ gstatus -ab
Cluster:
         Status: Healthy                 GlusterFS: 9.3
         Nodes: 3/3                      Volumes: 1/1

Volumes:

glusterfs-1-volume
                Replicate          Started (UP) - 3/3 Bricks Up  - (Arbiter Volume)
                                   Capacity: (1.82% used) 8.00 GiB/466.00 GiB (used/total)
                                   Self-Heal:
                                      192.168.1.50:/data/glusterfs (1 File(s) to heal).
                                   Bricks:
                                      Distribute Group 1:
                                         192.168.1.50:/data/glusterfs   (Online)
                                         192.168.1.51:/data/glusterfs   (Online)
                                         192.168.1.40:/data/glusterfs   (Online)

>$ gluster volume info
Volume Name: glusterfs-1-volume
Type: Replicate
Volume ID: f70d9b2c-b30d-4a36-b8ff-249c09c8b45d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.50:/data/glusterfs
Brick2: 192.168.1.51:/data/glusterfs
Brick3: 192.168.1.40:/data/glusterfs (arbiter)
Options Reconfigured:
cluster.lookup-optimize: off
server.keepalive-count: 5
server.keepalive-interval: 2
server.keepalive-time: 10
server.tcp-user-timeout: 20
network.ping-timeout: 20
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
performance.strict-o-direct: on
network.remote-dio: disable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

>$ gluster volume heal glusterfs-1-volume
Launching heal operation to perform index self heal on volume glusterfs-1-volume has been successful
Use heal info commands to check status.

>$ gluster volume heal glusterfs-1-volume info
Brick 192.168.1.50:/data/glusterfs
<gfid:26c5396c-86ff-408d-9cda-106acd2b0768>
Status: Connected
Number of entries: 1

Brick 192.168.1.51:/data/glusterfs
Status: Connected
Number of entries: 0

Brick 192.168.1.40:/data/glusterfs
Status: Connected
Number of entries: 0

>$ gluster volume heal glusterfs-1-volume info split-brain
Brick 192.168.1.50:/data/glusterfs
Status: Connected
Number of entries in split-brain: 0

Brick 192.168.1.51:/data/glusterfs
Status: Connected
Number of entries in split-brain: 0

Brick 192.168.1.40:/data/glusterfs
Status: Connected
Number of entries in split-brain: 0
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users