Re: glusterfs health-check failed, (brick) going down

Jiří Sléžka <jiri.slezka@xxxxxx> · Thu, 8 Jul 2021 16:52:54 +0200

Hi Olaf,

thanks for reply.

On 7/8/21 3:29 PM, Olaf Buitelaar wrote:
Hi Jiri,

your probleem looks pretty similar to mine, see; 
https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html 
<https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html>
Any chance you also see the xfs errors in de brick logs?

yes, I can see this log lines related to "health-check failed" items

[root@ovirt-hci02 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 
07:13:37.408010] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 
16:11:14.518844] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning

[root@ovirt-hci01 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 
13:15:51.982938] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-engine-posix: 
aio_read_cmp_buf() on 
/gluster_bricks/engine/engine/.glusterfs/health_check returned ret is -1 
error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 
01:53:35.768534] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning

it looks very similar to your issue but in my case I don't use LVM cache 
and brick disks are JBOD (but connected through Broadcom / LSI MegaRAID 
SAS-3 3008 [Fury] (rev 02)).

For me the situation improved once i disabled brick multiplexing, but i 
don't see that in your volume configuration.

probably important is your note...

When i kill the brick process and start with "gluser v start x force" the
issue seems much more unlikely to occur, but when started from a fresh
reboot, or when killing the process and let it being started by glusterd
(e.g. service glusterd start) the error seems to arise after a couple of
minutes.

...because in the ovirt list Jayme replied this

https://lists.ovirt.org/archives/list/users@xxxxxxxxx/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/

and it looks to me like something you also observes.

Cheers, Jiri

Cheers Olaf

Op do 8 jul. 2021 om 12:28 schreef Jiří Sléžka <jiri.slezka@xxxxxx 
<mailto:jiri.slezka@xxxxxx>>:

    Hello gluster community,

    I am new to this list but using glusterfs for log time as our SDS
    solution for storing 80+TiB of data. I'm also using glusterfs for small
    3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
    Glusterfs version here is 8.5-2.el8.x86_64.

    For time to time (I belive) random brick on random host goes down
    because health-check. It looks like

    [root@ovirt-hci02 ~]# grep "posix_health_check"
    /var/log/glusterfs/bricks/*
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    07:13:37.408184] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    07:13:37.408407] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
    still
    alive! -> SIGTERM
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    16:11:14.518971] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    16:11:14.519200] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
    still
    alive! -> SIGTERM

    on other host

    [root@ovirt-hci01 ~]# grep "posix_health_check"
    /var/log/glusterfs/bricks/*
    /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
    13:15:51.983327] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
    13:15:51.983728] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
    still alive! -> SIGTERM
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
    01:53:35.769129] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
    01:53:35.769819] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
    still
    alive! -> SIGTERM

    I cannot link these errors to any storage/fs issue (in dmesg or
    /var/log/messages), brick devices looks healthy (smartd).

    I can force start brick with

    gluster volume start vms|engine force

    and after some healing all works fine for few days

    Did anybody observe this behavior?

    vms volume has this structure (two bricks per host, each is separate
    JBOD ssd disk), engine volume has one brick on each host...

    gluster volume info vms

    Volume Name: vms
    Type: Distributed-Replicate
    Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 2 x 3 = 6
    Transport-type: tcp
    Bricks:
    Brick1: 10.0.4.11:/gluster_bricks/vms/vms
    Brick2: 10.0.4.13:/gluster_bricks/vms/vms
    Brick3: 10.0.4.12:/gluster_bricks/vms/vms
    Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
    Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
    Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
    Options Reconfigured:
    cluster.granular-entry-heal: enable
    performance.stat-prefetch: off
    cluster.eager-lock: enable
    performance.io-cache: off
    performance.read-ahead: off
    performance.quick-read: off
    user.cifs: off
    network.ping-timeout: 30
    network.remote-dio: off
    performance.strict-o-direct: on
    performance.low-prio-threads: 32
    features.shard: on
    storage.owner-gid: 36
    storage.owner-uid: 36
    transport.address-family: inet
    storage.fips-mode-rchecksum: on
    nfs.disable: on
    performance.client-io-threads: off

    Cheers,

    Jiri

    ________

    Community Meeting Calendar:

    Schedule -
    Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
    Bridge: https://meet.google.com/cpu-eiue-hvk
    <https://meet.google.com/cpu-eiue-hvk>
    Gluster-users mailing list
    Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
    https://lists.gluster.org/mailman/listinfo/gluster-users
    <https://lists.gluster.org/mailman/listinfo/gluster-users>

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users