Hi Amar,
Unfortunately I do not have the GlusterFS brick logs anymore - however I do have a hint:
I have 2 gluster (4.1.1) glusterfs volumes where I saw the issue - each has about 10-12 VMs active.
I also have 2 addl. gluster (4.1.1) glusterfs volumes, but with only 3-4 VMs, where I did not see the
issue (and they had been running for 1-2 months).
Thanx,
Claus.
P.S. We are talking about using Gluster "URI" with qemu - I hope - e.g. like
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source protocol='gluster' name='glu-vol03-lab/install3'>
<host name='install2.vlan13' port='24007'/>
</source>
<target dev='vda' bus='virtio'/>
</disk>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source protocol='gluster' name='glu-vol03-lab/install3'>
<host name='install2.vlan13' port='24007'/>
</source>
<target dev='vda' bus='virtio'/>
</disk>
On Mon, Aug 20, 2018 at 5:39 PM Amar Tumballi <atumball@xxxxxxxxxx> wrote:
On Mon, Aug 20, 2018 at 6:20 PM, Walter Deignan <WDeignan@xxxxxxxxx> wrote:I upgraded late last week to 4.1.2. Since then I've seen several posix health checks fail and bricks drop offline but I'm not sure if that's related or a different root issue.
I haven't seen the issue described below re-occur on 4.1.2 yet but it was intermittent to begin with so I'll probably need to run for a week or more to be confident.
Thanks for the update! We will be trying to reproduce the issue, and also root cause based on analysis of code, but if you get us brick logs around the time this happens, it may fasttrack the issue.Thanks again,Amar-Walter Deignan
-Uline IT, Systems Architect
From: "Claus Jeppesen" <cjeppesen@xxxxxxxxx>
To: WDeignan@xxxxxxxxx
Cc: gluster-users@xxxxxxxxxxx
Date: 08/20/2018 07:20 AM
Subject: Re: KVM lockups on Gluster 4.1.1
I think I have seen this also on our CentOS 7.5 systems using GlusterFS 4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.
Thanx,
Claus.
(*) libvirt/quemu log:
[2018-08-19 16:45:54.275830] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout = 1800 for
192.168.13.131:49152
[2018-08-19 17:16:03.691113] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout = 1800 for
192.168.13.132:49152
[2018-08-19 17:46:03.856170] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
... many repeats ...
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout = 1800 for
192.168.13.131:49152
[2018-08-19 18:16:04.022788] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout = 1800 for
192.168.13.132:49152
[2018-08-19 18:46:04.195881] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.6 (CentOS BuildSystem <http://bugs.centos.org>, 2018-06-27-14:13:57, x86-01.bsys.centos.org), qemu version: 1.5.3 (qemu-kvm-1.
5.3-156.el7_5.3)
At 19:37 the VM was restarted.
On Wed, Aug 15, 2018 at 8:25 PM Walter Deignan <WDeignan@xxxxxxxxx> wrote:
I am using gluster to host KVM/QEMU images. I am seeing an intermittent issue where access to an image will hang. I have to do a lazy dismount of the gluster volume in order to break the lock and then reset the impacted virtual machine.
It happened again today and I caught the events below in the client side logs. Any thoughts on what might cause this? It seemed to begin after I upgraded from 3.12.10 to 4.1.1 a few weeks ago.
[2018-08-14 14:22:15.549501] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]
[2018-08-14 14:22:15.549615] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]
[2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00 sent = 2018-08-14 14:22:15.699082. timeout = 1800 for 10.35.20.106:49159
[2018-08-14 14:52:18.726254] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d sent = 2018-08-14 14:52:18.726329. timeout = 1800 for 10.35.20.107:49164
[2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_blocking_locks_count_sufficient] 2-gv1-replicate-2: Unable to obtain blocking inode lock on even one child for gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not connected)
Volume configuration -
Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
Options Reconfigured:
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%
-Walter Deignan
-Uline IT, Systems Architect_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Claus Jeppesen Manager, Network Services Datto, Inc. p +45 6170 5901 | Copenhagen Office www.datto.com
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users--Amar Tumballi (amarts)
Claus Jeppesen |
Manager, Network Services |
Datto, Inc. |
p +45 6170 5901 | Copenhagen Office |
www.datto.com |
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users