RE: scrub scheduling

GuangYang <yguang11@xxxxxxxxxxx> · Mon, 9 Feb 2015 10:56:18 +0000



Thanks Sage!
----------------------------------------
> Date: Mon, 9 Feb 2015 02:24:33 -0800
> From: sweil@xxxxxxxxxx
> To: yguang11@xxxxxxxxxxx
> CC: ceph-devel@xxxxxxxxxxxxxxx; simon.leinen@xxxxxxxxx
> Subject: RE: scrub scheduling
>
> On Mon, 9 Feb 2015, GuangYang wrote:
>> Hi Sage,
>> Another potential problem with scrub scheduling, as observed in our
>> deployment (2PB cluster, 70% full), was that some PGs hadn't been
>> scrubbed for 1.5 months, even we have the configuration to do deep
>> scrubbing weekly.
>>
>> With our deployment and percentage of full of the cluster, as well as
>> the conservative setting for scrubbing (osd_max_scrubs = 1), one round
>> of scrubbing would not finish in 1 one week, so that we properly should
>> schedule that monthly (with weekly shallow scrubbing).
>
> This is simply a function of the amount of data and speed of the backend,
> right? Nothing we can fix in Ceph?
[yguang] Right, just wanted to put more context there. Sorry about the confusion.
>
>> Another problem, is that currently the scheduling of scrub is optimized
>> locally at each OSD, that is, for each PG this OSD acts as the primary,
>> it selects the one which hasn't been scheduled to scrubbing longest, put
>> it as the candidate and request scrub reserver from all replicas. Since
>> each OSD can only have 1 active scrubbing, that active slot could
>> potentially always occupied by a replica, as a result, the PG whose
>> primary is this OSD, fail to schedule and left behind.
>>
>> Is this issue worth an enhancement?
>
> Good point. Yeah, I think it's definitely worth fixing!
[yguang] I just opened a ticket to track - http://tracker.ceph.com/issues/10796
>
> sage
>
>>
>> Thanks,
>> Guang
>>
>>
>> ----------------------------------------
>>> Date: Sun, 8 Feb 2015 13:38:28 -0800
>>> From: sweil@xxxxxxxxxx
>>> To: ceph-devel@xxxxxxxxxxxxxxx
>>> CC: simon.leinen@xxxxxxxxx
>>> Subject: scrub scheduling
>>>
>>> Simon Leinen at Switch did a greaet post recently about the impact of
>>> scrub on their cluster(s):
>>>
>>> http://blog.simon.leinen.ch/2015/02/ceph-deep-scrubbing-impact.html
>>>
>>> Basically the 2 week deep scrub interval kicks in on exactly a 2 week
>>> cycle and the cluster goes crazy for a few hours and then does nothing
>>> (but client IO) for the next two weeks.
>>>
>>> The options governing this are:
>>>
>>> OPTION(osd_scrub_min_interval, OPT_FLOAT, 60*60*24) // if load is low
>>> OPTION(osd_scrub_max_interval, OPT_FLOAT, 7*60*60*24) // regardless of load
>>> OPTION(osd_deep_scrub_interval, OPT_FLOAT, 60*60*24*7) // once a week
>>> OPTION(osd_scrub_load_threshold, OPT_FLOAT, 0.5)
>>>
>>> That is, if the load is < .5 (probably almost never on a real cluster) it
>>> will scrub every day, otherwise (regardless of load) it will scrub each PG
>>> at least once a week.
>>>
>>> Several things we can do here:
>>>
>>> 1- Maybe the shallow scrub interval should be less than the deep scrub
>>> interval?
>>>
>>> 2- There is a new feature for hammer that limits scrub to certain times of
>>> day, contributed by Xinze Chi:
>>>
>>> OPTION(osd_scrub_begin_hour, OPT_INT, 0)
>>> OPTION(osd_scrub_end_hour, OPT_INT, 24)
>>>
>>> That is, by default, scrubs can happen at any time. You can use this to
>>> limit to certain hour sof the night, or whatever is appropriate for your
>>> cluster. That only sort of helps, though; Simon's scrub frenzy will still
>>> happen one day a week, all at once (or maybe spread over 2 nights).
>>>
>>> 3- We can spread them out during the allowed window. But how to do that?
>>> We could make the scrub interval randomly +/- a value of up to 50% of the
>>> total interval. Or we could somehow look at the current rate of scrubbing
>>> (average time to completeion for the current pool, maybe).. any look at
>>> the total number of items in the scrub queue?
>>>
>>> 4- Ric pointed out to me that even if we spread these out, scrubbing at
>>> full speed has an impact. Even if we do all the prioritization magic we
>>> can there will still be a buch of large IOs in the queue. What if we have
>>> a hard throttle on the scrub rate, objects per second and/or bytes per
>>> second? In the end the same number of IOs traverse the queue and
>>> potentially interfere with client IO, but they would be spread out over a
>>> longer period of time and be less noticeable (i.e., slow down client IOs
>>> from different workloads and not all the same workload). I'm not totally
>>> convinced this is an improvement over a strategy where we have only 1
>>> scrub IO in flight at all times, but that isn't quite how scrub schedules
>>> itself so it's hard to compare it that way, and in the end the user
>>> experience and perceived impact should be lower...
>>>
>>> 5- Auto-adjust the above scrub rate based on the total amount of data,
>>> scrub interval, and scrub hours so that we are scrubbing at the slowest
>>> rate possible that meets the schedule. We'd have to be slightly clever to
>>> have the right feedback in place here...
>>>
>>> Thoughts?
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> N????y????b?????v?????{.n??????z??ay????????j???f????????????????:+v??????????zZ+??????"?!?
 		 	   		  ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f