Re: [ceph-users] Re: RFC: (deep-)scrub manager module

Stefan Kooman <stefan@xxxxxx> · Mon, 20 Jun 2022 11:10:49 +0200

On 6/20/22 10:34, Frank Schilder wrote:
Hi all,

I guess moving scrubbing to the MGRs is quite an exercise if its currently scheduled by OSDs. With the current functionality, simply having a manager module issue "ceph osd deep-scrub" commands will only ask the OSD to schedule a scrub, not execute it in a specific order at specific times.

We currently have a daemon that does just that: ask the OSD for a 
(deep-)scrub. It has config options to tune the amount of concurrent 
(deep-)scrubs. It's not a hard limit, as it might take some time before 
a scrub is actually performed. So for short periods of time it might 
overshoot this target. But it issues (deep-)scrubs based on oldest 
timestamp, which is already a big win.

 From my personal experience, this effort might be a bit too much and a few simple changes might be enough. I don't see any performance issues with executing scrub on our cluster and I wonder under what circumstances such occur. 

For one: When RocksDB performance is degraded (particularly PGs that 
hold a lot of OMAP). Then *scrubs*, can be even more of a problem than 
deep-scrubs. Something we learned about very recently.

I would assume that scrub IO is part of the performance planning. My 
guess is, that pools where deep scrub impacts performance tend to have 
too few PGs (too many objects per PG). On small PGs, a deep scrub 
usually finishes quickly, even if it needs to be repeated due to client 
IO causing a pre-emption (osd_scrub_max_preemptions).

What I'm missing though is a bit more systematic approach to scrubbing in the same way that Anthony is suggesting. Right now it looks like that PGs are chosen randomly instead of "least recently scrubbed first". This is important, because it makes scrub completion unpredictable. Random number generators do not produce a perfect equi-distribution with the effect that some PGs will have to wait several almost complete cluster-scrub cycles before being deep-scrubbed eventually. This is probably a main reason why "not deep scrubbed in time" messages pop up. Its a few PGs that have to wait forever.

That confirms our observation. This one is the easiest to solve, and 
highest on the list.

This unpredictability is particularly annoying if you need to wait for all PGs being deep-scrubbed when upgrading over several versions of ceph. It is basically not possible to say a cluster with N PGs and certain scrub settings will complete a full deep-scrub of all PGs after so many days.

For such scenarios it would help a lot to have as many parallel scrubs, 
evenly distributed, as possible to speed up this process (like Josh 
suggested).

 From my perspective, adding this level of predictability would be an interesting improvement and possibly simple to implement. Instead of doing (deep-)scrubs randomly, one could consider a more systematic strategy of always queueing the "least-recently deep-scrubbed (active)" PGs first. This should already allow to reduce scrub frequency, because a complete scrub does not imply that a majority of PGs is scrubbed several times before the last one is picked. Plus, a completion of a complete scrub of a healthy cluster is now something one can more or less compute as a number in days. One could just look at the top of the queue to query the oldest scrub date stamp to have a reliable estimate. Per-device type scrub frequency should then be sufficient to adapt to device performance or other parameters, like failure probability.

Scrub frequency with respect to how likely a device might fail is 
interesting. Besides having the option to configure this per pool type 
(importance of data).

Its basically the first two single-starred points of Anthony, where I would be OK if its still the OSDs doing this scheduling as long as the scrub time stamp is treated as a global priority, that is, every primary OSD holds a few of its least-recently scrubbed PGs in a/the global queue ordered by scrub time stamp and the top PGs are picked and only kept at the top of the queue if it violates the osd_max_scrubs setting. As an extension, one could introduce scrub queues per device class. Using config option masks (https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#configuration-sections) it is already possible to configure scrub options by device class, which I personally consider sufficient.

Good point. The pain point of not having the oldest scrubbed PG be top 
priority next scrub cycle can be fixed at OSD level as well. For us it 
was easiest to fix it with a daemon. Not sure how easy it is to fix this 
in the OSD scrub code. It might not hurt to have it fixed in two places 
though.

Thanks for your input!

Gr. Stefan
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx