Hi Sage, Another potential problem with scrub scheduling, as observed in our deployment (2PB cluster, 70% full), was that some PGs hadn't been scrubbed for 1.5 months, even we have the configuration to do deep scrubbing weekly. With our deployment and percentage of full of the cluster, as well as the conservative setting for scrubbing (osd_max_scrubs = 1), one round of scrubbing would not finish in 1 one week, so that we properly should schedule that monthly (with weekly shallow scrubbing). Another problem, is that currently the scheduling of scrub is optimized locally at each OSD, that is, for each PG this OSD acts as the primary, it selects the one which hasn't been scheduled to scrubbing longest, put it as the candidate and request scrub reserver from all replicas. Since each OSD can only have 1 active scrubbing, that active slot could potentially always occupied by a replica, as a result, the PG whose primary is this OSD, fail to schedule and left behind. Is this issue worth an enhancement? Thanks, Guang ---------------------------------------- > Date: Sun, 8 Feb 2015 13:38:28 -0800 > From: sweil@xxxxxxxxxx > To: ceph-devel@xxxxxxxxxxxxxxx > CC: simon.leinen@xxxxxxxxx > Subject: scrub scheduling > > Simon Leinen at Switch did a greaet post recently about the impact of > scrub on their cluster(s): > > http://blog.simon.leinen.ch/2015/02/ceph-deep-scrubbing-impact.html > > Basically the 2 week deep scrub interval kicks in on exactly a 2 week > cycle and the cluster goes crazy for a few hours and then does nothing > (but client IO) for the next two weeks. > > The options governing this are: > > OPTION(osd_scrub_min_interval, OPT_FLOAT, 60*60*24) // if load is low > OPTION(osd_scrub_max_interval, OPT_FLOAT, 7*60*60*24) // regardless of load > OPTION(osd_deep_scrub_interval, OPT_FLOAT, 60*60*24*7) // once a week > OPTION(osd_scrub_load_threshold, OPT_FLOAT, 0.5) > > That is, if the load is < .5 (probably almost never on a real cluster) it > will scrub every day, otherwise (regardless of load) it will scrub each PG > at least once a week. > > Several things we can do here: > > 1- Maybe the shallow scrub interval should be less than the deep scrub > interval? > > 2- There is a new feature for hammer that limits scrub to certain times of > day, contributed by Xinze Chi: > > OPTION(osd_scrub_begin_hour, OPT_INT, 0) > OPTION(osd_scrub_end_hour, OPT_INT, 24) > > That is, by default, scrubs can happen at any time. You can use this to > limit to certain hour sof the night, or whatever is appropriate for your > cluster. That only sort of helps, though; Simon's scrub frenzy will still > happen one day a week, all at once (or maybe spread over 2 nights). > > 3- We can spread them out during the allowed window. But how to do that? > We could make the scrub interval randomly +/- a value of up to 50% of the > total interval. Or we could somehow look at the current rate of scrubbing > (average time to completeion for the current pool, maybe).. any look at > the total number of items in the scrub queue? > > 4- Ric pointed out to me that even if we spread these out, scrubbing at > full speed has an impact. Even if we do all the prioritization magic we > can there will still be a buch of large IOs in the queue. What if we have > a hard throttle on the scrub rate, objects per second and/or bytes per > second? In the end the same number of IOs traverse the queue and > potentially interfere with client IO, but they would be spread out over a > longer period of time and be less noticeable (i.e., slow down client IOs > from different workloads and not all the same workload). I'm not totally > convinced this is an improvement over a strategy where we have only 1 > scrub IO in flight at all times, but that isn't quite how scrub schedules > itself so it's hard to compare it that way, and in the end the user > experience and perceived impact should be lower... > > 5- Auto-adjust the above scrub rate based on the total amount of data, > scrub interval, and scrub hours so that we are scrubbing at the slowest > rate possible that meets the schedule. We'd have to be slightly clever to > have the right feedback in place here... > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f