Re: scrub scheduling

Samuel Just <sam.just@xxxxxxxxxxx> · Mon, 9 Feb 2015 09:22:03 -0800



I think most of the noise so far has been about the fairly narrow
problem that all of the osds tend to like to scrub at the same time.
I think we could get a lot of the way to fixing that by simply
randomizing the per-pg scrub schedule time when the pg registers
itself for the next scrub.
-Sam

On Mon, Feb 9, 2015 at 7:41 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Sun, Feb 8, 2015 at 1:38 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> Simon Leinen at Switch did a greaet post recently about the impact of
>> scrub on their cluster(s):
>>
>>         http://blog.simon.leinen.ch/2015/02/ceph-deep-scrubbing-impact.html
>>
>> Basically the 2 week deep scrub interval kicks in on exactly a 2 week
>> cycle and the cluster goes crazy for a few hours and then does nothing
>> (but client IO) for the next two weeks.
>>
>> The options governing this are:
>>
>> OPTION(osd_scrub_min_interval, OPT_FLOAT, 60*60*24)    // if load is low
>> OPTION(osd_scrub_max_interval, OPT_FLOAT, 7*60*60*24)  // regardless of load
>> OPTION(osd_deep_scrub_interval, OPT_FLOAT, 60*60*24*7) // once a week
>> OPTION(osd_scrub_load_threshold, OPT_FLOAT, 0.5)
>>
>> That is, if the load is < .5 (probably almost never on a real cluster) it
>> will scrub every day, otherwise (regardless of load) it will scrub each PG
>> at least once a week.
>>
>> Several things we can do here:
>>
>> 1- Maybe the shallow scrub interval should be less than the deep scrub
>> interval?
>>
>> 2- There is a new feature for hammer that limits scrub to certain times of
>> day, contributed by Xinze Chi:
>>
>> OPTION(osd_scrub_begin_hour, OPT_INT, 0)
>> OPTION(osd_scrub_end_hour, OPT_INT, 24)
>>
>> That is, by default, scrubs can happen at any time.  You can use this to
>> limit to certain hour sof the night, or whatever is appropriate for your
>> cluster.  That only sort of helps, though; Simon's scrub frenzy will still
>> happen one day a week, all at once (or maybe spread over 2 nights).
>>
>> 3- We can spread them out during the allowed window.  But how to do that?
>> We could make the scrub interval randomly +/- a value of up to 50% of the
>> total interval.  Or we could somehow look at the current rate of scrubbing
>> (average time to completeion for the current pool, maybe).. any look at
>> the total number of items in the scrub queue?
>>
>> 4- Ric pointed out to me that even if we spread these out, scrubbing at
>> full speed has an impact.  Even if we do all the prioritization magic we
>> can there will still be a buch of large IOs in the queue.  What if we have
>> a hard throttle on the scrub rate, objects per second and/or bytes per
>> second?  In the end the same number of IOs traverse the queue and
>> potentially interfere with client IO, but they would be spread out over a
>> longer period of time and be less noticeable (i.e., slow down client IOs
>> from different workloads and not all the same workload).  I'm not totally
>> convinced this is an improvement over a strategy where we have only 1
>> scrub IO in flight at all times, but that isn't quite how scrub schedules
>> itself so it's hard to compare it that way, and in the end the user
>> experience and perceived impact should be lower...
>>
>> 5- Auto-adjust the above scrub rate based on the total amount of data,
>> scrub interval, and scrub hours so that we are scrubbing at the slowest
>> rate possible that meets the schedule.  We'd have to be slightly clever to
>> have the right feedback in place here...
>
> Right. Fundamentally what we're trying to do here is schedule N IOs
> against a cluster within time period T, ideally without impacting any
> of the client IOs issued against the cluster. Depending on how many
> client IOs there are, and when they come in, that might be easy or
> might be imposslble (because they're using up all the IO capacity in
> the cluster themselves).
>
> So scheduling options can help by directing the scrubbing to occur at
> times when we don't expect client IO (at night or whatever), but in
> the general case I don't think we can actually solve it. What we can
> do is:
> 1) Try and figure out how much IO is required to do scrubbing, and
> alert users if their configuration won't succeed,
> 2) Prioritize scrubbing traffic against client IO more effectively
> than we do right now.
>
> I think (2) been on Sam's list of things to do for a while now, by
> making scrub ops into normal operations that go through a shared work
> queue with priority attached: http://tracker.ceph.com/issues/8635
> There's not a lot of detail there unfortunately, but if scrubbing was
> a regular operation it would solve many of the conflicts:
> * low priority would mean that instead of being 1 IO at a time, it
> would get time roughly proportional to its priority
> * scheduling would make it less likely to hit conflicts, and mean that
> in future we could even get clever about avoiding or backing off scrub
> on something in use by clients
> * it more naturally spreads out the scrub workload and makes it
> quickly apparent if the requested scrub rate is unsustainable (just
> track the rate of scrub completions against what we'd need it to be
> for success)
>
> I think that's probably a useful first step, and I'm pretty sure the
> general case doesn't have a closed-form solution so I'm leery of
> trying to build up a big system before it's in place.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html