Re: random disconnects of peers

Péter Károly JUHÁSZ <juhasz.peter.karoly@xxxxxxxxx> · Thu, 18 Aug 2022 13:26:50 +0200

Did you tired to TCPdump the connections to see who and how closes the connection? Normal fin-ack, or timeout? Maybe some network device between? (This later has small probably since you told that you can trigger the error by high load.)

 <dpgluster@xxxxxxxxx> 于 2022年8月18日周四 12:38写道：
I just niced all glusterfsd processes on all nodes to a value of -10. 

The problem just occured, so it seems nicing the processes didn't help.

Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ:

> What if you renice the gluster processes to some negative value?

> 

>  <dpgluster@xxxxxxxxx> 于 2022年8月18日周四 09:45写道：

> 

>> Hi folks,

>> 

>> i am running multiple GlusterFS servers in multiple datacenters.

>> Every

>> datacenter is basically the same setup: 3x storage nodes, 3x kvm

>> hypervisors (oVirt) and 2x HPE switches which are acting as one

>> logical

>> unit. The NICs of all servers are attached to both switches with a

>> bonding of two NICs, in case one of the switches has a major

>> problem.

>> In one datacenter i have strange problems with the glusterfs for

>> nearly

>> half of a year now and i'm not able to figure out the root cause.

>> 

>> Enviorment

>> - glusterfs 9.5 running on a centos 7.9.2009 (Core)

>> - three gluster volumes, all options equally configured

>> 

>> root@storage-001# gluster volume info

>> Volume Name: g-volume-domain

>> Type: Replicate

>> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0

>> Status: Started

>> Snapshot Count: 0

>> Number of Bricks: 1 x 3 = 3

>> Transport-type: tcp

>> Bricks:

>> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain

>> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain

>> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain

>> Options Reconfigured:

>> client.event-threads: 4

>> performance.cache-size: 1GB

>> server.event-threads: 4

>> server.allow-insecure: On

>> network.ping-timeout: 42

>> performance.client-io-threads: off

>> nfs.disable: on

>> transport.address-family: inet

>> cluster.quorum-type: auto

>> network.remote-dio: enable

>> cluster.eager-lock: enable

>> performance.stat-prefetch: off

>> performance.io-cache: off

>> performance.quick-read: off

>> cluster.data-self-heal-algorithm: diff

>> storage.owner-uid: 36

>> storage.owner-gid: 36

>> performance.readdir-ahead: on

>> performance.read-ahead: off

>> client.ssl: off

>> server.ssl: off

>> auth.ssl-allow:

>> 

> storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain

>> ssl.cipher-list: HIGH:!SSLv2

>> cluster.shd-max-threads: 4

>> diagnostics.latency-measurement: on

>> diagnostics.count-fop-hits: on

>> performance.io-thread-count: 32

>> 

>> Problem

>> The glusterd on one storage node seems to loose connection to one

>> another storage node. If the problem occurs, the first message in

>> /var/log/glusterfs/glusterd.log is always the following (variable

>> values

>> are filled with "x":

>> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> 

>> I will post a filtered log for this specific error on each of my

>> storage

>> nodes below.

>> storage-001:

>> root@storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log |

>> grep

>> "has disconnected from" | grep "2022-08-16"

>> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> root@storage-001#

>> 

>> storage-002:

>> root@storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log |

>> grep

>> "has disconnected from" | grep "2022-08-16"

>> [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> root@storage-002#

>> 

>> storage-003:

>> root@storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log |

>> grep

>> "has disconnected from" | grep "2022-08-16"

>> [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004]

>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:

>> Peer

>> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),

>> in

>> state <Peer in Cluster>, has disconnected from glusterd.

>> root@storage-003#

>> 

>> After this message it takes a couple secounds (in specific example

>> of

>> 2022-08-16 it's one to four secounds) and the disconnected node is

>> reachable again:

>> [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493]

>> [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd:

>> Received

>> ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host:

>> storage-002.my.domain, port: 0

>> 

>> This behavior is the same on all nodes - there is a disconnect of a

>> 

>> gluster node and a couple secounds later the disconnected node is

>> reachable again. After the reconnect the glustershd is invoked and

>> heals

>> all the data. How can i figure out the root cause of this random

>> disconnects?

>> 

>> My debugging actions so far:

>> - check dmesg -> zero messages around the time of the disconnects

>> - check the switch -> no port down/up, no packet errors

>> - disabled ssl on the gluster volumes -> disconnects are still

>> occuring

>> - check the dropped/error packages on the network interface of the

>> storage nodes -> no dropped packages, no errors

>> - constant pingcheck between all nodes, while a disconnect occurs

>> ->

>> zero packet loss, zero high latencys

>> - temporary deactivated one of the two interfaces which are

>> building the

>> bond -> disconnects are still occuring

>> - updated gluster from 6.x to 9.5 -> disconnects are still occuring

>> 

>> Important info: I can force this error to happen if i put some high

>> 

>> i/o-load to one of the gluster volumes.

>> 

>> I suspect there could be an issue with a network queue overflow or

>> something like that, but that theory does not match the result of

>> my

>> pingcheck.

>> 

>> What would be your next step to debug this error?

>> 

>> Thanks in advance!

>> ________

>> 

>> Community Meeting Calendar:

>> 

>> Schedule -

>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

>> Bridge: https://meet.google.com/cpu-eiue-hvk [1]

>> Gluster-users mailing list

>> Gluster-users@xxxxxxxxxxx

>> https://lists.gluster.org/mailman/listinfo/gluster-users [2]

> 

> 

> Links:

> ------

> [1] https://meet.google.com/cpu-eiue-hvk

> [2] https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users