Re: Nautilus Scrub and deep-Scrub execution order

"Johannes L" <johannes.liebl@xxxxxxxx> · Tue, 15 Sep 2020 08:41:12 -0000

Robin H. Johnson wrote:
> On Mon, Sep 14, 2020 at 11:40:22AM -0000, Johannes L wrote:
> >  Hello Ceph-Users
> >  
> >  after upgrading one of our clusters to Nautilus we noticed the x pgs not
> > scrubbed/deep-scrubbed in time warnings.
> >  Through some digging we found out that it seems like the scrubbing takes place at random
> > and doesn't take the age of the last scrub/deep-scrub into consideration.
> >  I dumped the time of the last scrub with a 90 min gap in between:
> >  ceph pg dump | grep active | awk '{print $22}' | sort | uniq -c
> >  dumped all
> >     2434 2020-08-30
> >     5935 2020-08-31
> >     1782 2020-09-01
> >        2 2020-09-02
> >        2 2020-09-03
> >        5 2020-09-06
> >        3 2020-09-08
> >        5 2020-09-09
> >       17 2020-09-10
> >      259 2020-09-12
> >    26672 2020-09-13
> >    12036 2020-09-14
> >  
> >  dumped all
> >     2434 2020-08-30
> >     5933 2020-08-31
> >     1782 2020-09-01
> >        2 2020-09-02
> >        2 2020-09-03
> >        5 2020-09-06
> >        3 2020-09-08
> >        5 2020-09-09
> >       17 2020-09-10
> >       51 2020-09-12
> >    24862 2020-09-13
> >    14056 2020-09-14
> >  
> >  It is pretty obvious that the PGs that have been scrubbed a day ago have been scrubbed
> > again for some reason while ones that are 2 weeks old are basically left untouched.
> >  One way we are currently dealing with this issue is setting the osd_scrub_min_interval to
> > 72h to force the cluster to scrub the older PGs.
> >  This can't be intentional.
> >  Has anyone else seen this behavior? Yes, this has existed for a long time; but the
> warnings are what's new.
> 
> - What's your workload? RBD/RGW/CephFS/???
> - Is there a pattern to which pools are behind?
> 
> At more than one job now, we've have written some tooling that drove the
> oldest scrubs in addition or instead of Ceph scheduling scrubs.
> 
> The one thing that absolutely stood out in that however, is some PGs
> that took much longer than others or never completed (and meant other
> PGs on those OSDs also got delayed). I never got to the bottom of why
> when I was at my last job, and it hasn't been priority enough at my
> current job for the once we saw it (and it may have been a precursor to
> a disk failing).

We use our own interface with librados to talk to our cluster and store plain objects.
The workload is mostly Write once read many with object size from 64k to 4M.

I don't see any pattern right now. It looks to me that every PG within the osd scrub begin/end hour/week_day parameters is pooled together and picked at random to scrub which in my eyes does not make any sense since this results in a bunch of PGs beeing scrubbed after a day or so while older PGs wont get touched for days.
It would make far more sense to prioritize PGs that have not been scrubbed for a longer period of time rather than seemingly at random.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx