Re: Hangs with qemu/libvirt/rbd when one host disappears

Alwin Antreich <a.antreich@xxxxxxxxxxx> · Wed, 6 Dec 2017 09:44:45 +0100

Hello Marcus,
On Tue, Dec 05, 2017 at 07:09:35PM +0100, Marcus Priesch wrote:
> Dear Ceph Users,
>
> first of all, big thanks to all the devs and people who made all this
> possible, ceph is amazing !!!
>
> ok, so let me get to the point where i need your help:
>
> i have a cluster of 6 hosts, mixed with ssd's and hdd's.
>
> on 4 of the 6 hosts are 21 vm's running in total with less to no
> workload (web, mail, elasticsearch) for a couple of users.
>
> 4 nodes are running ubuntu server and 2 of them are running proxmox
> (because we are now in the process of migrating towards proxmox).
>
> i am running ceph luminous (have upgraded two weeks ago)
I guess, you are running on ceph 12.2.1 (12.2.2 is out)? What does ceph versions say?

>
> ceph communication is carried out on a seperate 1Gbit Network where we
> plan to upgrade to bonded 2x10Gbit during the next couple of weeks.
With 6 hosts you will need 10GbE, alone for lower latency. Also a ceph
recovery/rebalance might max out the bandwidth of your link.

>
> i have two pools defined where i only use disk images via libvirt/rbd.
>
> the hdd pool has two replicas and is for large (~4TB) backup images and
> the ssd pool has three replicas (two on ssd osd's and one on hdd osd's)
> for improved fail safety and faster access for "live data" and OS
> images.
Mixing of spinners with SSDs is not recommended, as spinners will slow
down the pools residing on that root.

>
> in the crush map i have two different rules for the two pools so that
> replicas always are stored on different hosts - i have verified this and
> it works. it is coded via the "host" attribute (host node1-hdd and host
> node1 are both actually on the same host)
>
> so, now comes the interesting part:
>
> when i turn off one of the hosts (lets say node7) that do only ceph,
> after some time the vm's stall and hang until the host comes up again.
A stall of I/O shouldn't happen, what is your min_size of the pools? How
is your 'ceph osd tree' looking?
>
> when i dont turn on the host again, after some time the cluster starts
> rebalancing ...
Expected.

>
> yesterday i experienced that after a couple of hours of rebalancing the
> vm's continue working again - i think thats when the cluster has
> finished rebalancing ? havent really digged into this.
See above.

>
> well, today we turned off the same host (node7) again and i got stuck
> pg's again.
>
> this time i did some investigation and to my surprise i found the
> following in the output of ceph health detail:
>
> REQUEST_SLOW 17 slow requests are blocked > 32 sec
>     3 ops are blocked > 2097.15 sec
>     14 ops are blocked > 1048.58 sec
>     osds 9,10 have blocked requests > 1048.58 sec
>     osd.5 has blocked requests > 2097.15 sec
>
> i think the blocked requests are my problem, do they ?
That is a symptom of the problem, see above.

>
> but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
> tell me why the requests to this nodes got stuck ?
Those OSDs are waiting on other OSDs on host7, you can see that in the
ceph logs and you see with 'ceph pg dump' which pgs are located on which
OSDs.

>
> i have one pg in state "stuck unclean" which has its replicas on osd's
> 2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
> thought the "write op" should have gone there ... so why unclean ? the
> manual states "For stuck unclean placement groups, there is usually
> something preventing recovery from completing, like unfound objects" but
> there arent ...
unclean - The placement group has not been clean for too long (i.e., it
hasn’t been able to completely recover from a previous failure).
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups
How is your 1GbE utilized? I guess, with 6 nodes (3-4 OSDs) your link
might be maxed out. But you should get something in the ceph
logs.

>
> do i have a configuration issue here (amount of replicas?) or is this
> behavior simply just because my cluster network is too slow ?
>
> you can find detailed outputs here :
>
> 	https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY
>
> i hope any of you can help me shed any light on this ...
>
> at least the point of all is that a single host should be allowed to
> fail and the vm's continue running ... ;)
To get a better look at your setup, a crush map, ceph osd dump, ceph -s
and some log output would be nice.

Also you are moving to Proxmox, you might want to have look at the docs
& the forum.

Docs: https://pve.proxmox.com/pve-docs/
Forum: https://forum.proxmox.com
Some more useful information on PVE + Ceph: https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842

>
> regards and thanks in advance,
> marcus.
>
> --
> Marcus Priesch
> open source consultant - solution provider
> www.priesch.co.at / office@xxxxxxxxxxxxx
> A-2122 Riedenthal, In Prandnern 31 / +43 650 62 72 870
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com