Re: Hangs with qemu/libvirt/rbd when one host disappears

Alwin Antreich <a.antreich@xxxxxxxxxxx> · Thu, 7 Dec 2017 16:56:51 +0100

Hello Marcus,
On Thu, Dec 07, 2017 at 10:24:13AM +0100, Marcus Priesch wrote:
> Hello Alwin, Dear All,
>
> yesterday we finished cluster migration to proxmox and i had the same
> problem again:
>
> A couple of osd's down and out and a stuck request on a completely
> different osd which blocked the vm's.
>
> i tried to put this specific osd out (ceph osd out xx) and voila, the
> problem was gone. later on i put the osd back in and anything works as
> expected.
>
> in the meantime i read the post here:
>
> 	http://ceph.com/community/new-luminous-rados-improvements/
>
> where network problems with switches are also mentioned ...
>
> as the 1Gb network is completely busy in such a scenario i would assume
> maybe the problem is that some network communication got stuck somewhere
Yes, this will introduce the delay for PGs being written to the OSDs.
Especially as you have an uneven distribution of drives by server. The
servers with less OSDs will be hit more often as your failure domain is
on host level.

> ...
>
> however all in all the transition from ubuntu / jewel to ubuntu
> /luminous to proxmox / luminous went rather flawless - despite of the
> problem stated above - but i am aware that i am using ceph outside its
> requirements - so definitely *thumbs up* for ceph in general !!!!
>
> to your comments :
>
> >> i am running ceph luminous (have upgraded two weeks ago)
> > I guess, you are running on ceph 12.2.1 (12.2.2 is out)? What does ceph versions say?
>
> 12.2.1
12.2.2 has many improvments and bug fixes, worth to upgrade.

>
> >> ceph communication is carried out on a seperate 1Gbit Network where we
> >> plan to upgrade to bonded 2x10Gbit during the next couple of weeks.
> > With 6 hosts you will need 10GbE, alone for lower latency. Also a ceph
> > recovery/rebalance might max out the bandwidth of your link.
>
> yes, i think this is the problem ...
>
> > Mixing of spinners with SSDs is not recommended, as spinners will slow
> > down the pools residing on that root.
>
> why should this happen ? i would assume that osd's are seperate parts
> running on hosts - not influencing each other ?
>
> otherwise i would need a different set of hosts for the ssd's and the
> hdd's ?
It is the replication rule that the pool takes, AFAICT, the ssd ruleset
has a mix of spinners and SSDs to choose for PG placement.

In luminous there was the device class introduced, this makes it easy to
base rulesets apon them.

http://ceph.com/community/new-luminous-crush-device-classes/

>
> >> when i turn off one of the hosts (lets say node7) that do only ceph,
> >> after some time the vm's stall and hang until the host comes up again.
> > A stall of I/O shouldn't happen, what is your min_size of the pools? How
> > is your 'ceph osd tree' looking?
>
> you find it on the owncloud link ... at least ceph osd df tree
>
> >> but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
> >> tell me why the requests to this nodes got stuck ?
> > Those OSDs are waiting on other OSDs on host7, you can see that in the
> > ceph logs and you see with 'ceph pg dump' which pgs are located on which
> > OSDs.
>
> ok, you mean that they are waiting for operations to finish with the
> osd's that just went offline ?
>
> this should be a normal scenario when hardware fails - so this shouldnt
> lead to a stuck vm ... i assume ?
In the ceph-osd*.log there should be visibile what the OSD was waiting
on.

Could you also upload a couple osd logs, to see what these OSDs are
waiting on?

>
> >> i have one pg in state "stuck unclean" which has its replicas on osd's
> >> 2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
> >> thought the "write op" should have gone there ... so why unclean ? the
> >> manual states "For stuck unclean placement groups, there is usually
> >> something preventing recovery from completing, like unfound objects" but
> >> there arent ...
> > unclean - The placement group has not been clean for too long (i.e., it
> > hasn’t been able to completely recover from a previous failure).
> > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups
>
> i know this ... ther was no previous failure ... when i turn off some
> osd's i always get this after some time ...
Your crush ruleset for replicated_ssd has two steps and it will
replicate onto spinners too. The PG 1.12b from your query is sitting on
two SSDs and a spinner. The primary is osd.2 (SSD), but it has to wait
till the PG is also stored on the osd.3 & osd.15 (spinner).

>
> > How is your 1GbE utilized? I guess, with 6 nodes (3-4 OSDs) your link
> > might be maxed out. But you should get something in the ceph
> > logs.
>
> yes, it is maxed out ... i suspect that maybe its a problem of the
> network hardware that some packets get lost/stuck somewhere ...
The high count of seconds strongly indicates this.

>
> >> do i have a configuration issue here (amount of replicas?) or is this
> >> behavior simply just because my cluster network is too slow ?
> >>
> >> you can find detailed outputs here :
> >>
> >> 	https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY
> >>
> >> i hope any of you can help me shed any light on this ...
> >>
> >> at least the point of all is that a single host should be allowed to
> >> fail and the vm's continue running ... ;)
> > To get a better look at your setup, a crush map, ceph osd dump, ceph -s
> > and some log output would be nice.
>
> you should find all in ceph_report.txt in the link above ...
>
> > Also you are moving to Proxmox, you might want to have look at the docs
> > & the forum.
> >
> > Docs: https://pve.proxmox.com/pve-docs/
> > Forum: https://forum.proxmox.com
>
> thanks, been there ...
>
> > Some more useful information on PVE + Ceph: https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842
>
> havent read this ...
>
> thanks a lot !
> marcus.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cheers,
Alwin

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com