Re: glusterfs health-check failed, (brick) going down

Olaf Buitelaar <olaf.buitelaar@xxxxxxxxx> · Thu, 8 Jul 2021 15:29:06 +0200

Hi Jiri,
your probleem looks pretty similar to mine, see; https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
Any chance you also see the xfs errors in de brick logs?
For me the situation improved once i disabled brick multiplexing, but i don't see that in your volume configuration.

Cheers Olaf

Op do 8 jul. 2021 om 12:28 schreef Jiří Sléžka <jiri.slezka@xxxxxx>:
Hello gluster community,

I am new to this list but using glusterfs for log time as our SDS 

solution for storing 80+TiB of data. I'm also using glusterfs for small 

3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet). 

Glusterfs version here is 8.5-2.el8.x86_64.

For time to time (I belive) random brick on random host goes down 

because health-check. It looks like

[root@ovirt-hci02 ~]# grep "posix_health_check" /var/log/glusterfs/bricks/*

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 

07:13:37.408184] M [MSGID: 113075] 

[posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: 

health-check failed, going down

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 

07:13:37.408407] M [MSGID: 113075] 

[posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still 

alive! -> SIGTERM

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 

16:11:14.518971] M [MSGID: 113075] 

[posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: 

health-check failed, going down

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 

16:11:14.519200] M [MSGID: 113075] 

[posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still 

alive! -> SIGTERM

on other host

[root@ovirt-hci01 ~]# grep "posix_health_check" /var/log/glusterfs/bricks/*

/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 

13:15:51.983327] M [MSGID: 113075] 

[posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix: 

health-check failed, going down

/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 

13:15:51.983728] M [MSGID: 113075] 

[posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix: 

still alive! -> SIGTERM

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 

01:53:35.769129] M [MSGID: 113075] 

[posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: 

health-check failed, going down

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 

01:53:35.769819] M [MSGID: 113075] 

[posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still 

alive! -> SIGTERM

I cannot link these errors to any storage/fs issue (in dmesg or 

/var/log/messages), brick devices looks healthy (smartd).

I can force start brick with

gluster volume start vms|engine force

and after some healing all works fine for few days

Did anybody observe this behavior?

vms volume has this structure (two bricks per host, each is separate 

JBOD ssd disk), engine volume has one brick on each host...

gluster volume info vms

Volume Name: vms

Type: Distributed-Replicate

Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7

Status: Started

Snapshot Count: 0

Number of Bricks: 2 x 3 = 6

Transport-type: tcp

Bricks:

Brick1: 10.0.4.11:/gluster_bricks/vms/vms

Brick2: 10.0.4.13:/gluster_bricks/vms/vms

Brick3: 10.0.4.12:/gluster_bricks/vms/vms

Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2

Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2

Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2

Options Reconfigured:

cluster.granular-entry-heal: enable

performance.stat-prefetch: off

cluster.eager-lock: enable

performance.io-cache: off

performance.read-ahead: off

performance.quick-read: off

user.cifs: off

network.ping-timeout: 30

network.remote-dio: off

performance.strict-o-direct: on

performance.low-prio-threads: 32

features.shard: on

storage.owner-gid: 36

storage.owner-uid: 36

transport.address-family: inet

storage.fips-mode-rchecksum: on

nfs.disable: on

performance.client-io-threads: off

Cheers,

Jiri

________

Community Meeting Calendar:

Schedule -

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users