Re: KVM problems when rebalance occurs

Josef Johansson <josef86@xxxxxxxxx> · Wed, 6 Jan 2016 21:17:43 +0100

Hi,
Also make sure that you optimize the debug log config. There's a lot on the ML on how to set them all to low values (0/0).
Not sure how it's in infernalis but it did a lot in previous versions.
Regards,

Josef
On 6 Jan 2016 18:16, "Robert LeBlanc" <robert@xxxxxxxxxxxxx> wrote:
-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

There has been a lot of "discussion" about osd_backfill_scan[min,max]

lately. My experience with hammer has been opposite that of what

people have said before. Increasing those values for us has reduced

the load of recovery and has prevented a lot of the disruption seen in

our cluster caused by backfilling. It does increase the amount of time

to do the recovery (a new node added to the cluster took about 3-4

hours before, now takes about 24 hours).

We are currently using these values and seem to work well for us.

osd_max_backfills = 1

osd_backfill_scan_min = 16

osd_recovery_max_active = 1

osd_backfill_scan_max = 32

I would be interested in your results if you try these values.

-----BEGIN PGP SIGNATURE-----

Version: Mailvelope v1.3.2

Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWjUu/CRDmVDuy+mK58QAArdMQAI+0Er/sdN7TF7knGey2

5wJ6Ie81KJlrt/X9fIMpFdwkU2g5ET+sdU9R2hK4XcBpkonfGvwS8Ctha5Aq

XOJPrN4bMMeDK9Z4angK86ioLJevTH7tzp3FZL0U4Kbt1s9ZpwF6t+wlvkKl

mt6Tkj4VKr0917TuXqk58AYiZTYcEjGAb0QUe/gC24yFwZYrPO0vUVb4gmTQ

klNKAdTinGSn4Ynj+lBsEstWGVlTJiL3FA6xRBTz1BSjb4vtb2SoIFwHlAp+

GO+bKSh19YIasXCZfRqC/J2XcNauOIVfb4l4viV23JN2fYavEnLCnJSglYjF

Rjxr0wK+6NhRl7naJ1yGNtdMkw+h+nu/xsbYhNqT0EVq1d0nhgzh6ZjAhW1w

oRiHYA4KNn2uWiUgigpISFi4hJSP4CEPToO8jbhXhARs0H6v33oWrR8RYKxO

dFz+Lxx969rpDkk+1nRks9hTeIF+oFnW7eezSiR6TILYxvCZQ0ThHXQsL4ph

bvUr0FQmdV3ukC+Xwa/cePIlVY6JsIQfOlqmrtG7caTZWLvLUDwrwcleb272

243GXlbWCxoI7+StJDHPnY2k7NHLvbN2yG3f5PZvZaBgqqyAP8Fnq6CDtTIE

vZ/p+ZcuRw8lqoDgjjdiFyMmhQnFcCtDo3vtIy/UXDw23AVsI5edUyyP/sHt

ruPt

=X7SH

-----END PGP SIGNATURE-----

----------------

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Jan 6, 2016 at 7:13 AM, nick <nick@xxxxxxx> wrote:

> Heya,

> we are using a ceph cluster (6 Nodes with each having 10x4TB HDD + 2x SSD (for

> journal)) in combination with KVM virtualization. All our virtual machine hard

> disks are stored on the ceph cluster. The ceph cluster was updated to the

> 'infernalis' release recently.

>

> We are experiencing problems during cluster maintenance. A normal workflow for

> us looks like this:

>

> - set the noout flag for the cluster

> - stop all OSDs on one node

> - update the node

> - reboot the node

> - start all OSDs

> - wait for the backfilling to finish

> - unset the noout flag

>

> After we start all OSDs on the node again the cluster backfills and tries to

> get all the OSDs in sync. During the beginning of this process we experience

> 'stalls' in our running virtual machines. On some the load raises to a very

> high value. On others a running webserver responses only with 5xx HTTP codes.

> It takes around 5-6 minutes until all is ok again. After those 5-6 minutes the

> cluster is still backfilling, but the virtual machines behave normal again.

>

> I already set the following parameters in ceph.conf on the nodes to have a

> better rebalance traffic/user traffic ratio:

>

> """

> [osd]

> osd max backfills = 1

> osd backfill scan max = 8

> osd backfill scan min = 4

> osd recovery max active = 1

> osd recovery op priority = 1

> osd op threads = 8

> """

>

> It helped a bit, but we are still experiencing the above written problems. It

> feels like that for a short time some virtual hard disks are locked. Our ceph

> nodes are using bonded 10G network interfaces for the 'OSD network', so I do

> not think that network is a bottleneck.

>

> After reading this blog post:

> http://dachary.org/?p=2182

> I wonder if there is really a 'read lock' during the object push.

>

> Does anyone know more about this or do others have the same problems and were

> able to fix it?

>

> Best Regards

> Nick

>

> --

> Sebastian Nickel

> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich

> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com