On Mon, Nov 16, 2015 at 6:13 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Mon, 16 Nov 2015, Dan van der Ster wrote: >> On Mon, Nov 16, 2015 at 4:58 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> > On Mon, Nov 16, 2015 at 4:32 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> >> On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >>> On Mon, 16 Nov 2015, Dan van der Ster wrote: >> >>>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever >> >>>> the loadavg is decreasing (or below the threshold)? As long as the >> >>>> 1min loadavg is less than the 15min loadavg, we should be ok to allow >> >>>> new scrubs. If you agree I'll add the patch below to my PR. >> >>> >> >>> I like the simplicity of that, I'm afraid its going to just trigger a >> >>> feedback loop and oscillations on the host. I.e., as soo as we see *any* >> >>> decrease, all osds on the host will start to scrub, which will push the >> >>> load up. Once that round of PGs finish, the load will start to drop >> >>> again, triggering another round. This'll happen regardless of whether >> >>> we're in the peak hours or not, and the high-level goal (IMO at least) is >> >>> to do scrubbing in non-peak hours. >> >> >> >> We checked our OSDs' 24hr loadavg plots today and found that the >> >> original idea of 0.8 * 24hr loadavg wouldn't leave many chances for >> >> scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable. >> >> >> >> BTW, I realized there was a silly error in that earlier patch, and we >> >> anyway need an upper bound, say # cpus. So until your response came I >> >> was working with this idea: >> >> https://stikked.web.cern.ch/stikked/view/raw/5586a912 >> > >> > Sorry for SSO. Here: >> > >> > https://gist.github.com/dvanders/f3b08373af0f5957f589 >> >> Hi again. Here's a first shot at a daily loadavg heuristic: >> https://github.com/ceph/ceph/commit/15474124a183c7e92f457f836f7008a2813aa672 >> I had to guess where it would be best to store the daily_loadavg >> member and where to initialize it... please advise. >> >> I took the conservative approach of triggering scrubs when either: >> 1m loadavg < osd_scrub_load_threshold, or >> 1m loadavg < 24hr loadavg && 1m loadavg < 15m loadavg >> >> The whole PR would become this: >> https://github.com/ceph/ceph/compare/master...cernceph:wip-deepscrub-daily > > Looks reasonable to me! > > I'm still a bit worried that the 1m < 15m thing will mean that on the > completion of every scrub we have to wait ~1m before the next scrub > starts. Maybe that's okay, though... I'd say let's try this and adjust > that later if it seems problematic (conservative == better). > > sage Great. I've updated the PR: https://github.com/ceph/ceph/pull/6550 Cheers, Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html