Hangs with qemu/libvirt/rbd when one host disappears

Marcus Priesch <marcus@xxxxxxxxxxxxx> · Tue, 5 Dec 2017 19:09:35 +0100

Dear Ceph Users,

first of all, big thanks to all the devs and people who made all this
possible, ceph is amazing !!!

ok, so let me get to the point where i need your help:

i have a cluster of 6 hosts, mixed with ssd's and hdd's.

on 4 of the 6 hosts are 21 vm's running in total with less to no
workload (web, mail, elasticsearch) for a couple of users.

4 nodes are running ubuntu server and 2 of them are running proxmox
(because we are now in the process of migrating towards proxmox).

i am running ceph luminous (have upgraded two weeks ago)

ceph communication is carried out on a seperate 1Gbit Network where we
plan to upgrade to bonded 2x10Gbit during the next couple of weeks.

i have two pools defined where i only use disk images via libvirt/rbd.

the hdd pool has two replicas and is for large (~4TB) backup images and
the ssd pool has three replicas (two on ssd osd's and one on hdd osd's)
for improved fail safety and faster access for "live data" and OS
images.

in the crush map i have two different rules for the two pools so that
replicas always are stored on different hosts - i have verified this and
it works. it is coded via the "host" attribute (host node1-hdd and host
node1 are both actually on the same host)

so, now comes the interesting part:

when i turn off one of the hosts (lets say node7) that do only ceph,
after some time the vm's stall and hang until the host comes up again.

when i dont turn on the host again, after some time the cluster starts
rebalancing ...

yesterday i experienced that after a couple of hours of rebalancing the
vm's continue working again - i think thats when the cluster has
finished rebalancing ? havent really digged into this.

well, today we turned off the same host (node7) again and i got stuck
pg's again.

this time i did some investigation and to my surprise i found the
following in the output of ceph health detail:

REQUEST_SLOW 17 slow requests are blocked > 32 sec
    3 ops are blocked > 2097.15 sec
    14 ops are blocked > 1048.58 sec
    osds 9,10 have blocked requests > 1048.58 sec
    osd.5 has blocked requests > 2097.15 sec

i think the blocked requests are my problem, do they ?

but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
tell me why the requests to this nodes got stuck ?

i have one pg in state "stuck unclean" which has its replicas on osd's
2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
thought the "write op" should have gone there ... so why unclean ? the
manual states "For stuck unclean placement groups, there is usually
something preventing recovery from completing, like unfound objects" but
there arent ...

do i have a configuration issue here (amount of replicas?) or is this
behavior simply just because my cluster network is too slow ?

you can find detailed outputs here :

	https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY

i hope any of you can help me shed any light on this ...

at least the point of all is that a single host should be allowed to
fail and the vm's continue running ... ;)

regards and thanks in advance,
marcus.

-- 
Marcus Priesch
open source consultant - solution provider
www.priesch.co.at / office@xxxxxxxxxxxxx
A-2122 Riedenthal, In Prandnern 31 / +43 650 62 72 870
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com