Re: RFC: (deep-)scrub manager module

Frank Schilder <frans@xxxxxx> · Mon, 20 Jun 2022 08:34:15 +0000

Hi all,

I guess moving scrubbing to the MGRs is quite an exercise if its currently scheduled by OSDs. With the current functionality, simply having a manager module issue "ceph osd deep-scrub" commands will only ask the OSD to schedule a scrub, not execute it in a specific order at specific times.

>From my personal experience, this effort might be a bit too much and a few simple changes might be enough. I don't see any performance issues with executing scrub on our cluster and I wonder under what circumstances such occur. I would assume that scrub IO is part of the performance planning. My guess is, that pools where deep scrub impacts performance tend to have too few PGs (too many objects per PG). On small PGs, a deep scrub usually finishes quickly, even if it needs to be repeated due to client IO causing a pre-emption (osd_scrub_max_preemptions).

What I'm missing though is a bit more systematic approach to scrubbing in the same way that Anthony is suggesting. Right now it looks like that PGs are chosen randomly instead of "least recently scrubbed first". This is important, because it makes scrub completion unpredictable. Random number generators do not produce a perfect equi-distribution with the effect that some PGs will have to wait several almost complete cluster-scrub cycles before being deep-scrubbed eventually. This is probably a main reason why "not deep scrubbed in time" messages pop up. Its a few PGs that have to wait forever.

This unpredictability is particularly annoying if you need to wait for all PGs being deep-scrubbed when upgrading over several versions of ceph. It is basically not possible to say a cluster with N PGs and certain scrub settings will complete a full deep-scrub of all PGs after so many days.

>From my perspective, adding this level of predictability would be an interesting improvement and possibly simple to implement. Instead of doing (deep-)scrubs randomly, one could consider a more systematic strategy of always queueing the "least-recently deep-scrubbed (active)" PGs first. This should already allow to reduce scrub frequency, because a complete scrub does not imply that a majority of PGs is scrubbed several times before the last one is picked. Plus, a completion of a complete scrub of a healthy cluster is now something one can more or less compute as a number in days. One could just look at the top of the queue to query the oldest scrub date stamp to have a reliable estimate. Per-device type scrub frequency should then be sufficient to adapt to device performance or other parameters, like failure probability.

Its basically the first two single-starred points of Anthony, where I would be OK if its still the OSDs doing this scheduling as long as the scrub time stamp is treated as a global priority, that is, every primary OSD holds a few of its least-recently scrubbed PGs in a/the global queue ordered by scrub time stamp and the top PGs are picked and only kept at the top of the queue if it violates the osd_max_scrubs setting. As an extension, one could introduce scrub queues per device class. Using config option masks (https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#configuration-sections) it is already possible to configure scrub options by device class, which I personally consider sufficient.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: 19 June 2022 12:40:07
To: Ceph Developers
Cc: ceph-users
Subject:  Re: RFC: (deep-)scrub manager module

I requested a “scrubd” back before mgr was a thing, just sayin’ ;) Those of you who didn’t run, say, Dumpling or Firefly don’t know what you missed. Part of the problem has always been that OSDs — and not mons — schedule scrubs, so they are by nature solipsists and cannot orchestrate among each other.

We got the randomize_ratio instead, which did debulk the clumping issue I struggled with in Dumpling / Firefly, when at times I had to run while loops that would try to catch up by keeping at most N scrubs going at once.

I was partly inspired by the “planner” of the AMANDA backup system, which adaptively spreads full and incremental backups around the configured backup window / interval.

My idea was to have the daemon operate in the familiar tick fashion, highly configurable:

* Seperate queues for shallow and deep scrubs, if desired, though in theory shallow scrubs are inexpensive enough to leave be
* Sort the list of PGs at startup (or on each tick), by last scrub timestamp
* On each tick, issue the next or first N scrubs (configurable) constrained by optional criteria:
** Max per OSD
** Max per drive (new, since back then the idea of >1 OSD per drive was heresy)
** Max per HBA
** Max per host
** Max absolute number of OSDs (or PGs) across the whole cluster (@)
** Max percentage of OSDs (or PGs) across the whole cluster
** Thresholds of host load average, drive/OSD latency or %util — this was before I knew that %util is of limited meaning on SAS/SATA SSDs and pretty much meaningless on NVMe
** Threshold for number of slow requests per OSD, per host, or across the cluster
** Threshold for client read/write throughput
** Threshold for client IOPs
** Any OSDs or hosts down

The idea being to allow operators to pick the criteria that are most important to their specific deployment.

>> As long as we have a single PG scrub executing we can run scrubs on all the non-scrubbed OSD for free (since we already pay for the performance degradation)

I might make that an option, but I wouldn’t be comfortable with it as a default.  There are for example multi-device OSD scenarios where this might bottleneck, and AIUI that would have a shotgun effect on client impact.

>> For example, assume we have 100 OSDs and replica 3, we would like that when scrub runs we will have 33 PGs scrubbed simultaneously as long as no OSD appears in more than 1 PG so from OSD perspective 99 OSDs will execute scrub simultaneously (we can't get to 100 with 1 scrub only with 3 simultaneous scrubs per OSD).
>
> Yes, for sure this is  an interesting scrub strategy I hadn't thought about. With the reasoning behind it: make sure you scrub as many PGs as you can when allowed to, not wasting time (as long as its evenly distributed among OSDs). Do I get that right?

Don’t the existing max_*scrubs_per_osd options already do part of that, so that this is tantamount to my @ above?

>
>> Such a plan, with the other policies described (starting with the oldest scrubbed OSDs) should create an optimal plan when all the OSDs are symmetrical (same capacity and technology). Improving it for different capacities and technologies is an interesting exercise for future phases.
>
> Indeed. First start with most simple case. But good to know beforehand so the actual implementation can anticipate on future improvements.
>
>> One last point - we may want different priorities per pool (one pool requires weekly scrubs and another monthly scrubs), this should also be part of the scheduling algorithm.
>
> As you mention priorities: should it have some sort of fairness algorithm that avoids situations where a pool might not be scrubbed at all because of the constraints imposed? I can imagine a heavily loaded cluster, with spinning disks

Spinners are SOOO twenty-oughts ;)

> where a high priority pool might get scrubbed, but a lower priority pool might not. In that case you might want to scrub PGs from the lower priority pool every x period to avoid the pool never being scrubbed at all. This might make things (overly) complicated though.

Would policy of always issuing the least-recently-scrubbed PG next avoid this starvation?

>
> Gr. Stefan
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx