Re: KVM problems when rebalance occurs

nick <nick@xxxxxxx> · Fri, 08 Jan 2016 09:21:20 +0100

Hi,
benchmarking is done via fio and different blocksizes. I compared with 
benchmarks I did before the ceph.conf change and encountered very similar 
numbers. 

Thanks for the hint with mysql benchmarking. I will try it out.

Cheers
Nick

On Friday, January 08, 2016 06:59:13 AM Josef Johansson wrote:
> Hi,
> 
> How did you benchmark?
> 
> I would recommend to have a lot of mysql with a lot of innodb tables that
> are utilised heavily. During a recover you should see the latency rise at
> least. Maybe using one of the tools here
> https://dev.mysql.com/downloads/benchmarks.html
> 
> Regards,
> Josef
> 
> On 7 Jan 2016 16:36, "Robert LeBlanc" <robert@xxxxxxxxxxxxx> wrote:
> > With these min,max settings, we didn't have any problem going to more
> > backfills.
> > 
> > Robert LeBlanc
> > 
> > Sent from a mobile device please excuse any typos.
> > 
> > On Jan 7, 2016 8:30 AM, "nick" <nick@xxxxxxx> wrote:
> >> Heya,
> >> thank you for your answers. We will try to set 16/32 as values for
> >> osd_backfill_scan_[min|max]. I also set the debug logging config. Here is
> >> an
> >> excerpt of our new ceph.conf:
> >> 
> >> """
> >> [osd]
> >> osd max backfills = 1
> >> osd backfill scan max = 32
> >> osd backfill scan min = 16
> >> osd recovery max active = 1
> >> osd recovery op priority = 1
> >> osd op threads = 8
> >> 
> >> [global]
> >> debug optracker = 0/0
> >> debug asok = 0/0
> >> debug hadoop = 0/0
> >> debug mds migrator = 0/0
> >> debug objclass = 0/0
> >> debug paxos = 0/0
> >> debug context = 0/0
> >> debug objecter = 0/0
> >> debug mds balancer = 0/0
> >> debug finisher = 0/0
> >> debug auth = 0/0
> >> debug buffer = 0/0
> >> debug lockdep = 0/0
> >> debug mds log = 0/0
> >> debug heartbeatmap = 0/0
> >> debug journaler = 0/0
> >> debug mon = 0/0
> >> debug client = 0/0
> >> debug mds = 0/0
> >> debug throttle = 0/0
> >> debug journal = 0/0
> >> debug crush = 0/0
> >> debug objectcacher = 0/0
> >> debug filer = 0/0
> >> debug perfcounter = 0/0
> >> debug filestore = 0/0
> >> debug rgw = 0/0
> >> debug monc = 0/0
> >> debug rbd = 0/0
> >> debug tp = 0/0
> >> debug osd = 0/0
> >> debug ms = 0/0
> >> debug mds locker = 0/0
> >> debug timer = 0/0
> >> debug mds log expire = 0/0
> >> debug rados = 0/0
> >> debug striper = 0/0
> >> debug rbd replay = 0/0
> >> debug none = 0/0
> >> debug keyvaluestore = 0/0
> >> debug compressor = 0/0
> >> debug crypto = 0/0
> >> debug xio = 0/0
> >> debug civetweb = 0/0
> >> debug newstore = 0/0
> >> """
> >> 
> >> I already made a benchmark on our staging setup with the new config and
> >> fio, but
> >> did not really get different results than before.
> >> 
> >> For us it is hardly possible to reproduce the 'stalling' problems on the
> >> staging cluster so I will have to wait and test this in production.
> >> 
> >> Does anyone know if 'osd max backfills' > 1 could have an impact as well?
> >> The
> >> default seems to be 10...
> >> 
> >> Cheers
> >> Nick
> >> 
> >> On Wednesday, January 06, 2016 09:17:43 PM Josef Johansson wrote:
> >> > Hi,
> >> > 
> >> > Also make sure that you optimize the debug log config. There's a lot on
> >> 
> >> the
> >> 
> >> > ML on how to set them all to low values (0/0).
> >> > 
> >> > Not sure how it's in infernalis but it did a lot in previous versions.
> >> > 
> >> > Regards,
> >> > Josef
> >> > 
> >> > On 6 Jan 2016 18:16, "Robert LeBlanc" <robert@xxxxxxxxxxxxx> wrote:
> >> > > -----BEGIN PGP SIGNED MESSAGE-----
> >> > > Hash: SHA256
> >> > > 
> >> > > There has been a lot of "discussion" about osd_backfill_scan[min,max]
> >> > > lately. My experience with hammer has been opposite that of what
> >> > > people have said before. Increasing those values for us has reduced
> >> > > the load of recovery and has prevented a lot of the disruption seen
> >> > > in
> >> > > our cluster caused by backfilling. It does increase the amount of
> >> > > time
> >> > > to do the recovery (a new node added to the cluster took about 3-4
> >> > > hours before, now takes about 24 hours).
> >> > > 
> >> > > We are currently using these values and seem to work well for us.
> >> > > osd_max_backfills = 1
> >> > > osd_backfill_scan_min = 16
> >> > > osd_recovery_max_active = 1
> >> > > osd_backfill_scan_max = 32
> >> > > 
> >> > > I would be interested in your results if you try these values.
> >> > > -----BEGIN PGP SIGNATURE-----
> >> > > Version: Mailvelope v1.3.2
> >> > > Comment: https://www.mailvelope.com
> >> > > 
> >> > > wsFcBAEBCAAQBQJWjUu/CRDmVDuy+mK58QAArdMQAI+0Er/sdN7TF7knGey2
> >> > > 5wJ6Ie81KJlrt/X9fIMpFdwkU2g5ET+sdU9R2hK4XcBpkonfGvwS8Ctha5Aq
> >> > > XOJPrN4bMMeDK9Z4angK86ioLJevTH7tzp3FZL0U4Kbt1s9ZpwF6t+wlvkKl
> >> > > mt6Tkj4VKr0917TuXqk58AYiZTYcEjGAb0QUe/gC24yFwZYrPO0vUVb4gmTQ
> >> > > klNKAdTinGSn4Ynj+lBsEstWGVlTJiL3FA6xRBTz1BSjb4vtb2SoIFwHlAp+
> >> > > GO+bKSh19YIasXCZfRqC/J2XcNauOIVfb4l4viV23JN2fYavEnLCnJSglYjF
> >> > > Rjxr0wK+6NhRl7naJ1yGNtdMkw+h+nu/xsbYhNqT0EVq1d0nhgzh6ZjAhW1w
> >> > > oRiHYA4KNn2uWiUgigpISFi4hJSP4CEPToO8jbhXhARs0H6v33oWrR8RYKxO
> >> > > dFz+Lxx969rpDkk+1nRks9hTeIF+oFnW7eezSiR6TILYxvCZQ0ThHXQsL4ph
> >> > > bvUr0FQmdV3ukC+Xwa/cePIlVY6JsIQfOlqmrtG7caTZWLvLUDwrwcleb272
> >> > > 243GXlbWCxoI7+StJDHPnY2k7NHLvbN2yG3f5PZvZaBgqqyAP8Fnq6CDtTIE
> >> > > vZ/p+ZcuRw8lqoDgjjdiFyMmhQnFcCtDo3vtIy/UXDw23AVsI5edUyyP/sHt
> >> > > ruPt
> >> > > =X7SH
> >> > > -----END PGP SIGNATURE-----
> >> > > ----------------
> >> > > Robert LeBlanc
> >> > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> > > 
> >> > > On Wed, Jan 6, 2016 at 7:13 AM, nick <nick@xxxxxxx> wrote:
> >> > > > Heya,
> >> > > > we are using a ceph cluster (6 Nodes with each having 10x4TB HDD +
> >> 
> >> 2x
> >> 
> >> > > SSD (for
> >> > > 
> >> > > > journal)) in combination with KVM virtualization. All our virtual
> >> > > 
> >> > > machine hard
> >> > > 
> >> > > > disks are stored on the ceph cluster. The ceph cluster was updated
> >> 
> >> to
> >> 
> >> > > > the
> >> > > > 'infernalis' release recently.
> >> > > > 
> >> > > > We are experiencing problems during cluster maintenance. A normal
> >> > > 
> >> > > workflow for
> >> > > 
> >> > > > us looks like this:
> >> > > > 
> >> > > > - set the noout flag for the cluster
> >> > > > - stop all OSDs on one node
> >> > > > - update the node
> >> > > > - reboot the node
> >> > > > - start all OSDs
> >> > > > - wait for the backfilling to finish
> >> > > > - unset the noout flag
> >> > > > 
> >> > > > After we start all OSDs on the node again the cluster backfills and
> >> > > 
> >> > > tries to
> >> > > 
> >> > > > get all the OSDs in sync. During the beginning of this process we
> >> > > 
> >> > > experience
> >> > > 
> >> > > > 'stalls' in our running virtual machines. On some the load raises
> >> 
> >> to a
> >> 
> >> > > very
> >> > > 
> >> > > > high value. On others a running webserver responses only with 5xx
> >> 
> >> HTTP
> >> 
> >> > > codes.
> >> > > 
> >> > > > It takes around 5-6 minutes until all is ok again. After those 5-6
> >> > > 
> >> > > minutes the
> >> > > 
> >> > > > cluster is still backfilling, but the virtual machines behave
> >> > > > normal
> >> > > 
> >> > > again.
> >> > > 
> >> > > > I already set the following parameters in ceph.conf on the nodes to
> >> 
> >> have
> >> 
> >> > > a
> >> > > 
> >> > > > better rebalance traffic/user traffic ratio:
> >> > > > 
> >> > > > """
> >> > > > [osd]
> >> > > > osd max backfills = 1
> >> > > > osd backfill scan max = 8
> >> > > > osd backfill scan min = 4
> >> > > > osd recovery max active = 1
> >> > > > osd recovery op priority = 1
> >> > > > osd op threads = 8
> >> > > > """
> >> > > > 
> >> > > > It helped a bit, but we are still experiencing the above written
> >> > > 
> >> > > problems. It
> >> > > 
> >> > > > feels like that for a short time some virtual hard disks are
> >> 
> >> locked. Our
> >> 
> >> > > ceph
> >> > > 
> >> > > > nodes are using bonded 10G network interfaces for the 'OSD
> >> 
> >> network', so
> >> 
> >> > > I do
> >> > > 
> >> > > > not think that network is a bottleneck.
> >> > > > 
> >> > > > After reading this blog post:
> >> > > > http://dachary.org/?p=2182
> >> > > > I wonder if there is really a 'read lock' during the object push.
> >> > > > 
> >> > > > Does anyone know more about this or do others have the same
> >> 
> >> problems and
> >> 
> >> > > were
> >> > > 
> >> > > > able to fix it?
> >> > > > 
> >> > > > Best Regards
> >> > > > Nick
> >> > > > 
> >> > > > --
> >> > > > Sebastian Nickel
> >> > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> >> > > > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> >> > > > _______________________________________________
> >> > > > ceph-users mailing list
> >> > > > ceph-users@xxxxxxxxxxxxxx
> >> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > 
> >> > > _______________________________________________
> >> > > ceph-users mailing list
> >> > > ceph-users@xxxxxxxxxxxxxx
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> >> --
> >> Sebastian Nickel
> >> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> >> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
Attachment:
signature.asc

Description: This is a digitally signed message part.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com