Re: Deep Scrub distribution

Jonathan Proulx <jon@xxxxxxxxxxxxx> · Tue, 6 Mar 2018 11:44:43 -0500

On Tue, Mar 06, 2018 at 03:48:30PM +0000, David Turner wrote:
:I'm pretty sure I put up one of those scripts in the past.  Basically what
:we did was we set our scrub cycle to something like 40 days, we then sort
:all PGs by the last time they were deep scrubbed.  We grab the oldest 1/30
:of those PGs and tell them to deep-scrub manually, the next day we do it
:again.  After a month or so, your PGs should be fairly evenly spaced out
:over 30 days.  With those numbers you could disable the cron to run the
:deep-scrubs for maintenance up to 10 days every 40 days and still scrub all
:of your PGs during that time.

I think I had that script :)

But in Jewel (I believe it was jewel) ceph got smarter about spacing things out and
we ditched the cron job (though probably still have a copy of the
script).

Now we're on luminous things bunched up again.  The main problem being
they are bunched into 4days or so so there wouldn't be space for the
cron solution to work.

I have a theory on my potential mistake.  I had dropped zero from the
config briefly so thing were scheduled for 4.2 days rather than 42,
but "corrected" that and restarted all OSDs but the 'mgr' processses
still showed 4.2d config.

Which process actually decides to start scrubs?  osd, mgr, mon?

In any case I've just ensured all instance of all three are showing
the same value for osd_deep_scrub_interval.

I guess if we go from everything scrubbing to nothing scrubbing I'll
dust off the cron script so we even out rather than just have the same
pileup less frequently.

Thanks,
-Jon

:On Mon, Mar 5, 2018 at 2:00 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
:
:> On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx <jon@xxxxxxxxxxxxx>
:> wrote:
:>
:>> Hi All,
:>>
:>> I've recently noticed my deep scrubs are EXTREAMLY poorly
:>> distributed.  They are stating with in the 18->06 local time start
:>> stop time but are not distrubuted over enough days or well distributed
:>> over the range of days they have.
:>>
:>> root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print
:>> $20}'`; do date +%D -d $date; done | sort | uniq -c
:>> dumped all
:>>       1 03/01/18
:>>       6 03/03/18
:>>    8358 03/04/18
:>>    1875 03/05/18
:>>
:>> So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've
:>> been kicking this around for a while since I noticed poor distribution
:>> over a 7 day range when I was really pretty sure I'd changed that from
:>> the 7d default to 28d.
:>>
:>> Tried kicking it out to 42 days about a week ago with:
:>>
:>> ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800'
:>>
:>>
:>> There were many error suggesting it could nto reread the change and I'd
:>> need to restart the OSDs but 'ceph daemon osd.0 config show |grep
:>> osd_deep_scrub_interval' showed the right value so I let it roll for a
:>> week but the scrubs did not spread out.
:>>
:>> So Friday I set that value in ceph.conf and did rolling restarts of
:>> all OSDs.  Then doubled checked running value on all daemons.
:>> Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB
:>> voodoo above) show near enough 1/42nd of PGs had been scrubbed
:>> Saturday night that I thought this was working.
:>>
:>> This morning I checked again and got the results above.
:>>
:>> I would expect after changing to a 42d scrub cycle I'd see approx 1/42
:>> of the PGs deep scrub each night untill there was a roughly even
:>> distribution over the past 42 days.
:>>
:>> So which thing is broken my config or my expectations?
:>>
:>
:> Sadly, changing the interval settings does not directly change the
:> scheduling of deep scrubs. Instead, it merely influences whether a PG will
:> get queued for scrub when it is examined as a candidate, based on how
:> out-of-date its scrub is. (That is, nothing holistically goes "I need to
:> scrub 1/n of these PGs every night"; there's a simple task that says "is
:> this PG's last scrub more than n days old?")
:>
:> Users have shared various scripts on the list for setting up a more even
:> scrub distribution by fiddling with the settings and poking at specific PGs
:> to try and smear them out over the whole time period; I'd check archives or
:> google for those. :)
:> -Greg
:> _______________________________________________
:> ceph-users mailing list
:> ceph-users@xxxxxxxxxxxxxx
:> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
:>

-- 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com