Re: [ceph-users] Re: RFC: (deep-)scrub manager module

Josh Salomon <jsalomon@xxxxxxxxxx> · Mon, 20 Jun 2022 13:17:05 +0300

Replying on an older email in this thread. 

Yes, for sure this is  an interesting scrub strategy I hadn't thought
about. With the reasoning behind it: make sure you scrub as many PGs as
you can when allowed to, not wasting time (as long as its evenly
distributed among OSDs). Do I get that right?

Yes - this is what I meant

As you mention priorities: should it have some sort of fairness
algorithm that avoids situations where a pool might not be scrubbed at
all because of the constraints imposed? I can imagine a heavily loaded
cluster, with spinning disks, where a high priority pool might get
scrubbed, but a lower priority pool might not. In that case you might
want to scrub PGs from the lower priority pool every x period to avoid
the pool never being scrubbed at all. This might make things (overly)
complicated though.

I think this was answered, but I used priority which may be misleading - I meant different frequencies, so one pool is scrubbed every week and another every month. Then when you select the PGs for scrubbing you use the date of the latest scrub, so this will increase the priority of the pools with a lower frequency and I believe will prevent starving in all cases. Still need to understand what happens when the scrub frequency is, for example, a week and the process (with all the constraints such as running only on limited times) takes more than a week. I still believe that if at any time we look just at the PGs which were not scrubbed for the longest time we will prevent starvation. 
If we allow only one scrub per OSD, the way I see the scheduling algorithm is something like:
initial state, no scrub executing:
OSD_group = all OSDS.
selected = true
While (selected):
  selected = false
  try to select the PG which is marked for scrubbing and was not scrubbed for the longest period and all PG_OSDS in OSD_group
  if PG was selected
    mark it for scrubbing
    remove PG_OSDs from OSD_gropu
    selected = true
  end if 
end while
wait until a scrub_completes event and restart the while loop.

// scrub_complete event handling should add the PG_OSDS to OSD_group

Regards,
Josh

On Mon, Jun 20, 2022 at 12:13 PM Stefan Kooman <stefan@xxxxxx> wrote:
On 6/20/22 10:34, Frank Schilder wrote:

> Hi all,

> 

> I guess moving scrubbing to the MGRs is quite an exercise if its currently scheduled by OSDs. With the current functionality, simply having a manager module issue "ceph osd deep-scrub" commands will only ask the OSD to schedule a scrub, not execute it in a specific order at specific times.

We currently have a daemon that does just that: ask the OSD for a 

(deep-)scrub. It has config options to tune the amount of concurrent 

(deep-)scrubs. It's not a hard limit, as it might take some time before 

a scrub is actually performed. So for short periods of time it might 

overshoot this target. But it issues (deep-)scrubs based on oldest 

timestamp, which is already a big win.

> 

>  From my personal experience, this effort might be a bit too much and a few simple changes might be enough. I don't see any performance issues with executing scrub on our cluster and I wonder under what circumstances such occur. 

For one: When RocksDB performance is degraded (particularly PGs that 

hold a lot of OMAP). Then *scrubs*, can be even more of a problem than 

deep-scrubs. Something we learned about very recently.

I would assume that scrub IO is part of the performance planning. My 

guess is, that pools where deep scrub impacts performance tend to have 

too few PGs (too many objects per PG). On small PGs, a deep scrub 

usually finishes quickly, even if it needs to be repeated due to client 

IO causing a pre-emption (osd_scrub_max_preemptions).

> What I'm missing though is a bit more systematic approach to scrubbing in the same way that Anthony is suggesting. Right now it looks like that PGs are chosen randomly instead of "least recently scrubbed first". This is important, because it makes scrub completion unpredictable. Random number generators do not produce a perfect equi-distribution with the effect that some PGs will have to wait several almost complete cluster-scrub cycles before being deep-scrubbed eventually. This is probably a main reason why "not deep scrubbed in time" messages pop up. Its a few PGs that have to wait forever.

That confirms our observation. This one is the easiest to solve, and 

highest on the list.

> This unpredictability is particularly annoying if you need to wait for all PGs being deep-scrubbed when upgrading over several versions of ceph. It is basically not possible to say a cluster with N PGs and certain scrub settings will complete a full deep-scrub of all PGs after so many days.

For such scenarios it would help a lot to have as many parallel scrubs, 

evenly distributed, as possible to speed up this process (like Josh 

suggested).

> 

>  From my perspective, adding this level of predictability would be an interesting improvement and possibly simple to implement. Instead of doing (deep-)scrubs randomly, one could consider a more systematic strategy of always queueing the "least-recently deep-scrubbed (active)" PGs first. This should already allow to reduce scrub frequency, because a complete scrub does not imply that a majority of PGs is scrubbed several times before the last one is picked. Plus, a completion of a complete scrub of a healthy cluster is now something one can more or less compute as a number in days. One could just look at the top of the queue to query the oldest scrub date stamp to have a reliable estimate. Per-device type scrub frequency should then be sufficient to adapt to device performance or other parameters, like failure probability.

Scrub frequency with respect to how likely a device might fail is 

interesting. Besides having the option to configure this per pool type 

(importance of data).

> 

> Its basically the first two single-starred points of Anthony, where I would be OK if its still the OSDs doing this scheduling as long as the scrub time stamp is treated as a global priority, that is, every primary OSD holds a few of its least-recently scrubbed PGs in a/the global queue ordered by scrub time stamp and the top PGs are picked and only kept at the top of the queue if it violates the osd_max_scrubs setting. As an extension, one could introduce scrub queues per device class. Using config option masks (https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#configuration-sections) it is already possible to configure scrub options by device class, which I personally consider sufficient.

Good point. The pain point of not having the oldest scrubbed PG be top 

priority next scrub cycle can be fixed at OSD level as well. For us it 

was easiest to fix it with a daemon. Not sure how easy it is to fix this 

in the OSD scrub code. It might not hurt to have it fixed in two places 

though.

Thanks for your input!

Gr. Stefan

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx