Re: KVM lockups on Gluster 4.1.1

Amar Tumballi <atumball@xxxxxxxxxx> · Mon, 20 Aug 2018 21:08:30 +0530

On Mon, Aug 20, 2018 at 6:20 PM, Walter Deignan <WDeignan@xxxxxxxxx> wrote:
I upgraded late last week
to 4.1.2. Since then I've seen several posix health checks fail and bricks
drop offline but I'm not sure if that's related or a different root issue.

I haven't seen the
issue described below re-occur on 4.1.2 yet but it was intermittent to
begin with so I'll probably need to run for a week or more to be confident.

Thanks for the update! We will be trying to reproduce the issue, and also root cause based on analysis of code, but if you get us brick logs around the time this happens, it may fasttrack the issue.

Thanks again,
Amar

-Walter Deignan

-Uline IT, Systems Architect

From:
       "Claus
Jeppesen" <cjeppesen@xxxxxxxxx>

To:
       WDeignan@xxxxxxxxx

Cc:
       gluster-users@xxxxxxxxxxx

Date:
       08/20/2018
07:20 AM

Subject:
       Re:
 KVM lockups on Gluster 4.1.1

I think I have seen this also on our
CentOS 7.5 systems using GlusterFS 4.1.1 (*) - has an upgrade to 4.1.2
helped out ? I'm trying this now.

Thanx,

Claus.

(*)  libvirt/quemu log:

[2018-08-19 16:45:54.275830] E [MSGID:
114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0:
remote operation failed [Invalid argument] 

[2018-08-19 16:45:54.276156] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Invalid argument] 

[2018-08-19 16:45:54.276159] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000:
unlock failed on subvolume glu-vol

01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument] 

[2018-08-19 16:45:54.276183] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000:
unlock failed on subvolume glu-vol

01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument] 

[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-0:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x3071a5
sent = 2018-08-19 16:45:54.276560. timeout = 1800 for

192.168.13.131:49152

[2018-08-19 17:16:03.691113] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected] 

[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-1:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x301d0f
sent = 2018-08-19 17:16:03.691174. timeout = 1800 for

192.168.13.132:49152

[2018-08-19 17:46:03.856170] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected] 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

... many repeats ... 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-0:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x307221
sent = 2018-08-19 17:46:03.861005. timeout = 1800 for

192.168.13.131:49152

[2018-08-19 18:16:04.022788] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected] 

[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-1:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x301d8a
sent = 2018-08-19 18:16:04.022838. timeout = 1800 for

192.168.13.132:49152

[2018-08-19 18:46:04.195881] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected] 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1) 

qemu: terminating on signal 15 from pid 507 

2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed 

2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0, package:
14.el7_5.6 (CentOS BuildSystem <http://bugs.centos.org>,
2018-06-27-14:13:57, x86-01.bsys.centos.org),
qemu version: 1.5.3 (qemu-kvm-1.

5.3-156.el7_5.3)

At 19:37 the VM was restarted.

On Wed, Aug 15, 2018 at 8:25 PM Walter
Deignan <WDeignan@xxxxxxxxx>
wrote:

I am using gluster
to host KVM/QEMU images. I am seeing an intermittent issue where access
to an image will hang. I have to do a lazy dismount of the gluster volume
in order to break the lock and then reset the impacted virtual machine.

It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.

[2018-08-14 14:22:15.549501] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-4: remote operation failed [Invalid argument]

[2018-08-14 14:22:15.549576] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Invalid argument]

[2018-08-14 14:22:15.549583] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000:
unlock failed on subvolume gv1-client-4 with lock owner d89caca92b7f0000
[Invalid argument] 

[2018-08-14 14:22:15.549615] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000:
unlock failed on subvolume gv1-client-5 with lock owner d89caca92b7f0000
[Invalid argument] 

[2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for 10.35.20.106:49159

[2018-08-14 14:52:18.726254] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-4: remote operation failed [Transport endpoint is not connected]

[2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for 10.35.20.107:49164

[2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Transport endpoint is not connected]

[2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_blocking_locks_count_sufficient]
2-gv1-replicate-2: Unable to obtain blocking inode lock on even one child
for gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.

[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse:
3715808: FSYNC() ERR => -1 (Transport endpoint is not connected)

Volume configuration - 

Volume Name: gv1 

Type: Distributed-Replicate 

Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc

Status: Started 

Snapshot Count: 0 

Number of Bricks: 5 x 2 = 10 

Transport-type: tcp 

Bricks: 

Brick1: dc-vihi44:/gluster/bricks/megabrick/data

Brick2: dc-vihi45:/gluster/bricks/megabrick/data

Brick3: dc-vihi44:/gluster/bricks/brick1/data

Brick4: dc-vihi45:/gluster/bricks/brick1/data

Brick5: dc-vihi44:/gluster/bricks/brick2_1/data

Brick6: dc-vihi45:/gluster/bricks/brick2/data

Brick7: dc-vihi44:/gluster/bricks/brick3/data

Brick8: dc-vihi45:/gluster/bricks/brick3/data

Brick9: dc-vihi44:/gluster/bricks/brick4/data

Brick10: dc-vihi45:/gluster/bricks/brick4/data

Options Reconfigured: 

cluster.min-free-inodes: 6% 

performance.client-io-threads: off

nfs.disable: on 

transport.address-family: inet 

performance.quick-read: off 

performance.read-ahead: off 

performance.io-cache: off 

performance.low-prio-threads: 32 

network.remote-dio: enable 

cluster.eager-lock: enable 

cluster.server-quorum-type: server

cluster.data-self-heal-algorithm: full

cluster.locking-scheme: granular 

cluster.shd-max-threads: 8 

cluster.shd-wait-qlength: 10000 

user.cifs: off 

cluster.choose-local: off 

features.shard: on 

cluster.server-quorum-ratio: 51% 

-Walter Deignan

-Uline IT, Systems Architect_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

-- 

Claus
Jeppesen

Manager,
Network Services

Datto,
Inc.

p
+45 6170 5901 | Copenhagen Office

www.datto.com

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Amar Tumballi (amarts)

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users