KVM problems when rebalance occurs

nick <nick@xxxxxxx> · Wed, 06 Jan 2016 15:13:13 +0100

Heya,
we are using a ceph cluster (6 Nodes with each having 10x4TB HDD + 2x SSD (for 
journal)) in combination with KVM virtualization. All our virtual machine hard 
disks are stored on the ceph cluster. The ceph cluster was updated to the 
'infernalis' release recently.

We are experiencing problems during cluster maintenance. A normal workflow for 
us looks like this:

- set the noout flag for the cluster
- stop all OSDs on one node
- update the node
- reboot the node
- start all OSDs
- wait for the backfilling to finish
- unset the noout flag

After we start all OSDs on the node again the cluster backfills and tries to 
get all the OSDs in sync. During the beginning of this process we experience 
'stalls' in our running virtual machines. On some the load raises to a very 
high value. On others a running webserver responses only with 5xx HTTP codes. 
It takes around 5-6 minutes until all is ok again. After those 5-6 minutes the 
cluster is still backfilling, but the virtual machines behave normal again.

I already set the following parameters in ceph.conf on the nodes to have a 
better rebalance traffic/user traffic ratio:

"""
[osd]
osd max backfills = 1
osd backfill scan max = 8
osd backfill scan min = 4
osd recovery max active = 1
osd recovery op priority = 1
osd op threads = 8
"""

It helped a bit, but we are still experiencing the above written problems. It 
feels like that for a short time some virtual hard disks are locked. Our ceph 
nodes are using bonded 10G network interfaces for the 'OSD network', so I do 
not think that network is a bottleneck.

After reading this blog post:
http://dachary.org/?p=2182
I wonder if there is really a 'read lock' during the object push.

Does anyone know more about this or do others have the same problems and were 
able to fix it?

Best Regards
Nick

-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
Attachment:
signature.asc

Description: This is a digitally signed message part.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com