Re: Deep scrubbing causes severe I/O stalling

Wido den Hollander <wido@xxxxxxxx> · Fri, 28 Oct 2016 14:43:55 +0200 (CEST)

> Op 28 oktober 2016 om 13:18 schreef Kees Meijs <kees@xxxxxxxx>:
> 
> 
> Hi,
> 
> On 28-10-16 12:06, wido@xxxxxxxx wrote:
> > I don't like this personally. Your cluster should be capable of doing
> > a deep scrub at any moment. If not it will also not be able to handle
> > a node failure during peak times.
> 
> Valid point and I totally agree. Unfortunately, the current load doesn't
> give me much of a choice I'm afraid. Tweaking and extending the cluster
> hardware (e.g. more and faster spinners) makes more sense but we're not
> there yet.
> 

Ok, just wanted to mention it.

> Maybe the new parameters help us towards the "always capable" momentum.
> Let's hope for the best and see what'll happen. ;-) If it works out, I
> could (and will) remove the time constraints.
> 
> >   * osd_scrub_sleep .1
> >
> > You can try to bump that even more.
> 
> Thank you for pointing that out. I'm unsure about the osd_scrub_sleep
> parameter behaviour (documentation is scarce). Could you please shed a
> little light on this?

It is how much time it sleeps between a scrub operation. It gives the underlying time to do other I/O.

https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L3949

The scrub will take more time, but it will also be less intensive.

While doing that you also might want to set the priorities:

  osd disk thread ioprio class = idle
  osd disk thread ioprio priority = 3
  osd recovery op priority = 5
  osd client op priority = 63

Make sure you use the CFQ disk scheduler for your disks though.

Wido

> 
> Cheers,
> Kees
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com