Re: Nautilus Scrub and deep-Scrub execution order

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Mon, 14 Sep 2020 18:47:59 +0000

On Mon, Sep 14, 2020 at 11:40:22AM -0000, Johannes L wrote:
> Hello Ceph-Users
> 
> after upgrading one of our clusters to Nautilus we noticed the x pgs not scrubbed/deep-scrubbed in time warnings.
> Through some digging we found out that it seems like the scrubbing takes place at random and doesn't take the age of the last scrub/deep-scrub into consideration.
> I dumped the time of the last scrub with a 90 min gap in between:
> ceph pg dump | grep active | awk '{print $22}' | sort | uniq -c
> dumped all
>    2434 2020-08-30
>    5935 2020-08-31
>    1782 2020-09-01
>       2 2020-09-02
>       2 2020-09-03
>       5 2020-09-06
>       3 2020-09-08
>       5 2020-09-09
>      17 2020-09-10
>     259 2020-09-12
>   26672 2020-09-13
>   12036 2020-09-14
> 
> dumped all
>    2434 2020-08-30
>    5933 2020-08-31
>    1782 2020-09-01
>       2 2020-09-02
>       2 2020-09-03
>       5 2020-09-06
>       3 2020-09-08
>       5 2020-09-09
>      17 2020-09-10
>      51 2020-09-12
>   24862 2020-09-13
>   14056 2020-09-14
> 
> It is pretty obvious that the PGs that have been scrubbed a day ago have been scrubbed again for some reason while ones that are 2 weeks old are basically left untouched.
> One way we are currently dealing with this issue is setting the osd_scrub_min_interval to 72h to force the cluster to scrub the older PGs.
> This can't be intentional.
> Has anyone else seen this behavior?
Yes, this has existed for a long time; but the warnings are what's new.

- What's your workload? RBD/RGW/CephFS/???
- Is there a pattern to which pools are behind?

At more than one job now, we've have written some tooling that drove the
oldest scrubs in addition or instead of Ceph scheduling scrubs.

The one thing that absolutely stood out in that however, is some PGs
that took much longer than others or never completed (and meant other
PGs on those OSDs also got delayed). I never got to the bottom of why
when I was at my last job, and it hasn't been priority enough at my
current job for the once we saw it (and it may have been a precursor to
a disk failing).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@xxxxxxxxxx
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx