Hello, I have a replica 3 volume that has lost quorum twice this week causing us much pain. What seems to happen is one of the sans thinks one of the other two peers has disconnected. Then a few seconds later another disconnects causing quorum to be lost.
This causes us pain since we have 7 ovirt host that are connected to this gluster volume and they never seem to reattach. I was able to unmount the brick manually on the ovirt host and then run the commands to mount them again and that seemed to get things
working again.
We have 3 sans running glusterfs 3.12.14-1 and nothing else.
# gluster volume info gv1
Volume Name: gv1
Type: Replicate
Volume ID: ea12f72d-a228-43ba-a360-4477cada292a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.4.16.19:/glusterfs/data1/gv1
Brick2: 10.4.16.11:/glusterfs/data1/gv1
Brick3: 10.4.16.12:/glusterfs/data1/gv1
Options Reconfigured:
nfs.register-with-portmap: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.self-heal-daemon: enable
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: 10.4.16.*
nfs.rpc-auth-allow: 10.4.16.*
nfs.disable: off
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
nfs.addr-namelookup: off
nfs.export-volumes: on
network.ping-timeout: 50
cluster.server-quorum-ratio: 51%
They produced the following logs this morning. and the first entry is the first entry for 2019-06-07.
san3 seems to have an issue first:
[2019-06-07 14:23:20.670561] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <10.4.16.12> (<dfe01058-5bea-4b67-8859-382a2c8854f4>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-06-07 14:23:20.774127] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <10.4.16.11> (<0f3090ee-080b-4a6b-9964-0ca86d801469>), in state <Peer in Cluster>, has disconnected from glusterd.
san1 follows:
[2019-06-07 14:23:22.137405] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <10.4.16.12> (<dfe01058-5bea-4b67-8859-382a2c8854f4>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-06-07 14:23:22.229343] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <10.4.16.19> (<238af98a-d2f1-491d-a1f1-64ace4eb6d3d>), in state <Peer in Cluster>, has disconnected from glusterd.
san2 seems to be the last one standing but quorum gets lost:
[2019-06-07 14:23:26.611435] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <10.4.16.11> (<0f3090ee-080b-4a6b-9964-0ca86d801469>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-06-07 14:23:26.714137] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <10.4.16.19> (<238af98a-d2f1-491d-a1f1-64ace4eb6d3d>), in state <Peer in Cluster>, has disconnected from glusterd.
On the ovirt host I see the following type of entries for the gluster brick that's mounted /var/log/glusterfs/rhev-data-center-mnt-glusterSD-10.4.16.11:gv1.log. They are all pretty much the same entries on all 7 host.
hv6 seems to be the first host to complain:
[2019-06-07 14:23:22.190493] I [glusterfsd-mgmt.c:2424:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.4.16.11
[2019-06-07 14:23:22.190540] I [glusterfsd-mgmt.c:2464:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server 10.4.16.19
[2019-06-07 14:23:32.618071] I [glusterfsd-mgmt.c:2005:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2019-06-07 14:23:33.651755] W [socket.c:719:__socket_rwv] 0-gv1-client-4: readv on 10.4.16.12:49152 failed (No data available)
[2019-06-07 14:23:33.651806] I [MSGID: 114018] [client.c:2288:client_rpc_notify] 0-gv1-client-4: disconnected from gv1-client-4. Client process will keep trying to connect to glusterd until brick's port is available
One thing I should point out here that is probably important. We are running glusterfs 3.12.14-1 on the sans but ovirt host have been upgraded to
5.6-1. We stopped updating the sans gluster version after the previous version had a memory leak causing the sans to go down randomly. Version 3.12.14-1 has seemed to stop this from happening. What I'm not finding is there a
incompatibility between these versions that could cause this?
Are there any other steps I can take or logs I can collect to better identify what's causing this to happen?
Edward Clay
|
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users