slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

Jelle de Jong <jelledejong@xxxxxxxxxxxxx> · Mon, 6 Jan 2020 20:44:16 +0100

Hello everybody,

I have issues with very slow requests a simple tree node cluster here, 
four WDC enterprise disks and Intel Optane NVMe journal on identical 
high memory nodes, with 10GB networking.

It was working all good with Ceph Hammer on Debian Wheezy, but I wanted 
to upgrade to a supported version and test out bluestore as well. So I 
upgraded to luminous on Debian Stretch and used ceph-volume to create 
bluestore osds, everything went downhill from there.

I went back to filestore on all nodes but I still have slow requests and 
I can not pinpoint a good reason I tried to debug and gathered 
information to look at:

https://paste.debian.net/hidden/acc5d204/

First I thought it was the balancing that was making things slow, then I 
thought it might be the LVM layer, so I recreated the nodes without LVM 
by switching from ceph-volume to ceph-disk, no different still slow 
request. Then I changed back from bluestore to filestore but still the a 
very slow cluster. Then I thought it was a CPU scheduling issue and 
downgraded the 5.x kernel and CPU performance is full speed again. I 
thought maybe there is something weird with an osd and taking them out 
one by one, but slow request are still showing up and client performance 
from vms is really poor.

I just feel a burst of small requests keeps blocking for a while then 
recovers again.

Many thanks for helping out looking at the URL.

If there are options which I should tune for a hdd with nvme journal 
setup please share.

Jelle
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com