Re: RFC: (deep-)scrub manager module

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sun, 19 Jun 2022 03:40:07 -0700

I requested a “scrubd” back before mgr was a thing, just sayin’ ;) Those of you who didn’t run, say, Dumpling or Firefly don’t know what you missed. Part of the problem has always been that OSDs — and not mons — schedule scrubs, so they are by nature solipsists and cannot orchestrate among each other.

We got the randomize_ratio instead, which did debulk the clumping issue I struggled with in Dumpling / Firefly, when at times I had to run while loops that would try to catch up by keeping at most N scrubs going at once.

I was partly inspired by the “planner” of the AMANDA backup system, which adaptively spreads full and incremental backups around the configured backup window / interval.

My idea was to have the daemon operate in the familiar tick fashion, highly configurable:

* Seperate queues for shallow and deep scrubs, if desired, though in theory shallow scrubs are inexpensive enough to leave be
* Sort the list of PGs at startup (or on each tick), by last scrub timestamp
* On each tick, issue the next or first N scrubs (configurable) constrained by optional criteria:
** Max per OSD
** Max per drive (new, since back then the idea of >1 OSD per drive was heresy)
** Max per HBA
** Max per host
** Max absolute number of OSDs (or PGs) across the whole cluster (@)
** Max percentage of OSDs (or PGs) across the whole cluster
** Thresholds of host load average, drive/OSD latency or %util — this was before I knew that %util is of limited meaning on SAS/SATA SSDs and pretty much meaningless on NVMe
** Threshold for number of slow requests per OSD, per host, or across the cluster
** Threshold for client read/write throughput
** Threshold for client IOPs
** Any OSDs or hosts down

The idea being to allow operators to pick the criteria that are most important to their specific deployment.

>> As long as we have a single PG scrub executing we can run scrubs on all the non-scrubbed OSD for free (since we already pay for the performance degradation)

I might make that an option, but I wouldn’t be comfortable with it as a default.  There are for example multi-device OSD scenarios where this might bottleneck, and AIUI that would have a shotgun effect on client impact.

>> For example, assume we have 100 OSDs and replica 3, we would like that when scrub runs we will have 33 PGs scrubbed simultaneously as long as no OSD appears in more than 1 PG so from OSD perspective 99 OSDs will execute scrub simultaneously (we can't get to 100 with 1 scrub only with 3 simultaneous scrubs per OSD).
> 
> Yes, for sure this is  an interesting scrub strategy I hadn't thought about. With the reasoning behind it: make sure you scrub as many PGs as you can when allowed to, not wasting time (as long as its evenly distributed among OSDs). Do I get that right?

Don’t the existing max_*scrubs_per_osd options already do part of that, so that this is tantamount to my @ above?

> 
>> Such a plan, with the other policies described (starting with the oldest scrubbed OSDs) should create an optimal plan when all the OSDs are symmetrical (same capacity and technology). Improving it for different capacities and technologies is an interesting exercise for future phases.
> 
> Indeed. First start with most simple case. But good to know beforehand so the actual implementation can anticipate on future improvements.
> 
>> One last point - we may want different priorities per pool (one pool requires weekly scrubs and another monthly scrubs), this should also be part of the scheduling algorithm.
> 
> As you mention priorities: should it have some sort of fairness algorithm that avoids situations where a pool might not be scrubbed at all because of the constraints imposed? I can imagine a heavily loaded cluster, with spinning disks

Spinners are SOOO twenty-oughts ;)

> where a high priority pool might get scrubbed, but a lower priority pool might not. In that case you might want to scrub PGs from the lower priority pool every x period to avoid the pool never being scrubbed at all. This might make things (overly) complicated though.

Would policy of always issuing the least-recently-scrubbed PG next avoid this starvation?

> 
> Gr. Stefan
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx