Re: Deep Scrub distribution

David Turner <drakonstein@xxxxxxxxx> · Tue, 06 Mar 2018 15:48:30 +0000

I'm pretty sure I put up one of those scripts in the past.  Basically what we did was we set our scrub cycle to something like 40 days, we then sort all PGs by the last time they were deep scrubbed.  We grab the oldest 1/30 of those PGs and tell them to deep-scrub manually, the next day we do it again.  After a month or so, your PGs should be fairly evenly spaced out over 30 days.  With those numbers you could disable the cron to run the deep-scrubs for maintenance up to 10 days every 40 days and still scrub all of your PGs during that time.

On Mon, Mar 5, 2018 at 2:00 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx <jon@xxxxxxxxxxxxx> wrote:
Hi All,

I've recently noticed my deep scrubs are EXTREAMLY poorly

distributed.  They are stating with in the 18->06 local time start

stop time but are not distrubuted over enough days or well distributed

over the range of days they have.

root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print $20}'`; do date +%D -d $date; done | sort | uniq -c

dumped all

      1 03/01/18

      6 03/03/18

   8358 03/04/18

   1875 03/05/18

So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've

been kicking this around for a while since I noticed poor distribution

over a 7 day range when I was really pretty sure I'd changed that from

the 7d default to 28d.

Tried kicking it out to 42 days about a week ago with:

ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800'

There were many error suggesting it could nto reread the change and I'd

need to restart the OSDs but 'ceph daemon osd.0 config show |grep

osd_deep_scrub_interval' showed the right value so I let it roll for a

week but the scrubs did not spread out.

So Friday I set that value in ceph.conf and did rolling restarts of

all OSDs.  Then doubled checked running value on all daemons.

Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB

voodoo above) show near enough 1/42nd of PGs had been scrubbed

Saturday night that I thought this was working.

This morning I checked again and got the results above.

I would expect after changing to a 42d scrub cycle I'd see approx 1/42

of the PGs deep scrub each night untill there was a roughly even

distribution over the past 42 days.

So which thing is broken my config or my expectations?

Sadly, changing the interval settings does not directly change the scheduling of deep scrubs. Instead, it merely influences whether a PG will get queued for scrub when it is examined as a candidate, based on how out-of-date its scrub is. (That is, nothing holistically goes "I need to scrub 1/n of these PGs every night"; there's a simple task that says "is this PG's last scrub more than n days old?")

Users have shared various scripts on the list for setting up a more even scrub distribution by fiddling with the settings and poking at specific PGs to try and smear them out over the whole time period; I'd check archives or google for those. :)
-Greg
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com