Re: osd max scrubs not honored?

Christian Balzer <chibi@xxxxxxx> · Fri, 29 Sep 2017 09:02:47 +0900

Hello,

On Thu, 28 Sep 2017 22:36:22 +0000 Gregory Farnum wrote:

> Also, realize the deep scrub interval is a per-PG thing and (unfortunately)
> the OSD doesn't use a global view of its PG deep scrub ages to try and
> schedule them intelligently across that time. If you really want to try and
> force this out, I believe a few sites have written scripts to do it by
> turning off deep scrubs, forcing individual PGs to deep scrub at intervals,
> and then enabling deep scrubs again.
> -Greg
> 
This approach works best and w/o surprises down the road if 
osd_scrub_interval_randomize_ratio is disabled.
And the osd_scrub_start_hour and osd_scrub_end_hour set to your needs.

I basically kick the deep scrubs off on a per OSD basis (one at a
time and staggered of course) and if your cluster is small/fast enough
that pattern will be retained indefinitely, with only one PG doing a deep
scrub at any given time (with the default max scrub of 1 of course). 

Christian

> On Wed, Sep 27, 2017 at 6:34 AM David Turner <drakonstein@xxxxxxxxx> wrote:
> 
> > This isn't an answer, but a suggestion to try and help track it down as
> > I'm not sure what the problem is. Try querying the admin socket for your
> > osds and look through all of their config options and settings for
> > something that might explain why you have multiple deep scrubs happening on
> > a single osd at the same time.
> >
> > However if you misspoke and only have 1 deep scrub per osd but multiple
> > people node, then what you are seeing is expected behavior.  I believe that
> > luminous added a sleep seeing for scrub io that also might help.  Looking
> > through the admin socket dump of settings looking for scrub should give you
> > some ideas of things to try.
> >
> > On Tue, Sep 26, 2017, 2:04 PM J David <j.david.lists@xxxxxxxxx> wrote:
> >  
> >> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also
> >> the default, at almost all times, there are 2-3 deep scrubs running.
> >>
> >> 3 simultaneous deep scrubs is enough to cause a constant stream of:
> >>
> >> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32
> >> sec (REQUEST_SLOW)
> >>
> >> This seems to correspond with all three deep scrubs hitting the same
> >> OSD at the same time, starving out all other I/O requests for that
> >> OSD.  But it can happen less frequently and less severely with two or
> >> even one deep scrub running.  Nonetheless, consumers of the cluster
> >> are not thrilled with regular instances of 30-60 second disk I/Os.
> >>
> >> The cluster is five nodes, 15 OSDs, and there is one pool with 512
> >> placement groups.  The cluster is running:
> >>
> >> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous
> >> (rc)
> >>
> >> All of the OSDs are bluestore, with HDD storage and SSD block.db.
> >>
> >> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this
> >> issue, though it seems to get the number down from 3 to 2, which at
> >> least cuts down on the frequency of requests stalling out.  With 512
> >> pgs, that should mean that one pg gets deep-scrubbed per hour, and it
> >> seems like a deep-scrub takes about 20 minutes.  So what should be
> >> happening is that 1/3rd of the time there should be one deep scrub,
> >> and 2/3rds of the time there shouldn’t be any.  Yet instead we have
> >> 2-3 deep scrubs running at all times.
> >>
> >> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched per
> >> hour:
> >>
> >> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' |
> >> fgrep 2017-09-26 | sort -rn | head -22
> >> dumped all
> >> 2017-09-26 16:42:46.781761 0.181
> >> 2017-09-26 16:41:40.056816 0.59
> >> 2017-09-26 16:39:26.216566 0.9e
> >> 2017-09-26 16:26:43.379806 0.19f
> >> 2017-09-26 16:24:16.321075 0.60
> >> 2017-09-26 16:08:36.095040 0.134
> >> 2017-09-26 16:03:33.478330 0.b5
> >> 2017-09-26 15:55:14.205885 0.1e2
> >> 2017-09-26 15:54:31.413481 0.98
> >> 2017-09-26 15:45:58.329782 0.71
> >> 2017-09-26 15:34:51.777681 0.1e5
> >> 2017-09-26 15:32:49.669298 0.c7
> >> 2017-09-26 15:01:48.590645 0.1f
> >> 2017-09-26 15:01:00.082014 0.199
> >> 2017-09-26 14:45:52.893951 0.d9
> >> 2017-09-26 14:43:39.870689 0.140
> >> 2017-09-26 14:28:56.217892 0.fc
> >> 2017-09-26 14:28:49.665678 0.e3
> >> 2017-09-26 14:11:04.718698 0.1d6
> >> 2017-09-26 14:09:44.975028 0.72
> >> 2017-09-26 14:06:17.945012 0.8a
> >> 2017-09-26 13:54:44.199792 0.ec
> >>
> >> What’s going on here?
> >>
> >> Why isn’t the limit on scrubs being honored?
> >>
> >> It would also be great if scrub I/O were surfaced in “ceph status” the
> >> way recovery I/O is, especially since it can have such a significant
> >> impact on client operations.
> >>
> >> Thanks!
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>  
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com