Re: KVM lockups on Gluster 4.1.1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Aug 20, 2018 at 6:20 PM, Walter Deignan <WDeignan@xxxxxxxxx> wrote:
I upgraded late last week to 4.1.2. Since then I've seen several posix health checks fail and bricks drop offline but I'm not sure if that's related or a different root issue.

I haven't seen the issue described below re-occur on 4.1.2 yet but it was intermittent to begin with so I'll probably need to run for a week or more to be confident.


Thanks for the update! We will be trying to reproduce the issue, and also root cause based on analysis of code, but if you get us brick logs around the time this happens, it may fasttrack the issue.

Thanks again,
Amar
 
-Walter Deignan
-Uline IT, Systems Architect




From:        "Claus Jeppesen" <cjeppesen@xxxxxxxxx>
To:        WDeignan@xxxxxxxxx
Cc:        gluster-users@xxxxxxxxxxx
Date:        08/20/2018 07:20 AM
Subject:        Re: KVM lockups on Gluster 4.1.1




I think I have seen this also on our CentOS 7.5 systems using GlusterFS 4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.

Thanx,

Claus.

(*)  libvirt/quemu log:
[2018-08-19 16:45:54.275830] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout = 1800 for

192.168.13.131:49152
[2018-08-19 17:16:03.691113] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout = 1800 for

192.168.13.132:49152
[2018-08-19 17:46:03.856170] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
... many repeats ... 
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout = 1800 for

192.168.13.131:49152
[2018-08-19 18:16:04.022788] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail] 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout = 1800 for

192.168.13.132:49152
[2018-08-19 18:46:04.195881] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.6 (CentOS BuildSystem <
http://bugs.centos.org>, 2018-06-27-14:13:57, x86-01.bsys.centos.org), qemu version: 1.5.3 (qemu-kvm-1.
5.3-156.el7_5.3)


At 19:37 the VM was restarted.



On Wed, Aug 15, 2018 at 8:25 PM Walter Deignan <WDeignan@xxxxxxxxx> wrote:
I am using gluster to host KVM/QEMU images. I am seeing an intermittent issue where access to an image will hang. I have to do a lazy dismount of the gluster volume in order to break the lock and then reset the impacted virtual machine.

It happened again today and I caught the events below in the client side logs. Any thoughts on what might cause this? It seemed to begin after I upgraded from 3.12.10 to 4.1.1 a few weeks ago.


[2018-08-14 14:22:15.549501] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote operation failed [Invalid argument]

[2018-08-14 14:22:15.549576] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote operation failed [Invalid argument]

[2018-08-14 14:22:15.549583] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]

[2018-08-14 14:22:15.549615] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]

[2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00 sent = 2018-08-14 14:22:15.699082. timeout = 1800 for
10.35.20.106:49159
[2018-08-14 14:52:18.726254] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote operation failed [Transport endpoint is not connected]

[2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5: bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d sent = 2018-08-14 14:52:18.726329. timeout = 1800 for
10.35.20.107:49164
[2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote operation failed [Transport endpoint is not connected]

[2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_blocking_locks_count_sufficient] 2-gv1-replicate-2: Unable to obtain blocking inode lock on even one child for gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.

[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk] 0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not connected)


Volume configuration -


Volume Name: gv1

Type: Distributed-Replicate

Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc

Status: Started

Snapshot Count: 0

Number of Bricks: 5 x 2 = 10

Transport-type: tcp

Bricks:

Brick1: dc-vihi44:/gluster/bricks/megabrick/data

Brick2: dc-vihi45:/gluster/bricks/megabrick/data

Brick3: dc-vihi44:/gluster/bricks/brick1/data

Brick4: dc-vihi45:/gluster/bricks/brick1/data

Brick5: dc-vihi44:/gluster/bricks/brick2_1/data

Brick6: dc-vihi45:/gluster/bricks/brick2/data

Brick7: dc-vihi44:/gluster/bricks/brick3/data

Brick8: dc-vihi45:/gluster/bricks/brick3/data

Brick9: dc-vihi44:/gluster/bricks/brick4/data

Brick10: dc-vihi45:/gluster/bricks/brick4/data

Options Reconfigured:

cluster.min-free-inodes: 6%

performance.client-io-threads: off

nfs.disable: on

transport.address-family: inet

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.low-prio-threads: 32

network.remote-dio: enable

cluster.eager-lock: enable

cluster.server-quorum-type: server

cluster.data-self-heal-algorithm: full

cluster.locking-scheme: granular

cluster.shd-max-threads: 8

cluster.shd-wait-qlength: 10000

user.cifs: off

cluster.choose-local: off

features.shard: on

cluster.server-quorum-ratio: 51%


-Walter Deignan
-Uline IT, Systems Architect
_______________________________________________
Gluster-users mailing list

Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users


--
Claus Jeppesen
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users



--
Amar Tumballi (amarts)
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux