brick process crashes on "Structure needs cleaning"

Olaf Buitelaar <olaf.buitelaar@xxxxxxxxx> · Mon, 22 Feb 2021 12:52:54 +0100

Dear Users,
Somehow the brick processes seem to crash on xfs filesystem error's. It seems it depends on the way the gluster process is started. Also gluster sends on this occurrence a message to the console, informing the process will go down, however it doesn't really seem to go down;

M [MSGID: 113075] [posix-helpers.c:2185:posix_health_check_thread_proc] 0-ovirt-engine-posix: health-check failed, going down
 M [MSGID: 113075] [posix-helpers.c:2203:posix_health_check_thread_proc] 0-ovirt-engine-posix: still alive! -> SIGTERM

in the brick log a message like this is logged;
[posix-helpers.c:2111:posix_fs_health_check] 0-ovirt-data-posix: aio_read_cmp_buf() on /data5/gfs/bricks/brick1/ovirt-data/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning

or like this;
 W [MSGID: 113075] [posix-helpers.c:2111:posix_fs_health_check] 0-ovirt-mon-2-posix: aio_read_buf() on /data0/gfs/bricks/bricka/ovirt-mon-2/.glusterfs/health_check returned ret is -1 error is Success

when i check the actual file it just seems to contain a timestamp;
cat /data0/gfs/bricks/bricka/ovirt-mon-2/.glusterfs/health_check
2021-01-28 09:08:01⏎

And don't see errors in DMESG about having issues accessing it.

When i unmount the filesystem and run xfs_repair on it, no error's/issues are reported. Also when i mount the filesystem again, it's reported as a clean mount;
[2478552.169540] XFS (dm-23): Mounting V5 Filesystem
[2478552.180645] XFS (dm-23): Ending clean mount

When i kill the brick process and start with "gluser v start x force" the issue seems much more unlikely to occur, but when started from a fresh reboot, or when killing the process and let it being started by glusterd (e.g. service glusterd start) the error seems to arise after a couple of minutes. 

I am making use of LVM cache (in write through mode), maybe that's related. Also the disks it self are backed by a hardware raid controller and i did inspect all disks for SMART errors.

Does anybody has experience with this, and a clue on what might causing this?

Thanks Olaf

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users