Re: Nautilus Scrub and deep-Scrub execution order

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Thu, 17 Sep 2020 13:48:27 -0400

On 9/15/2020 4:41 AM, Johannes L wrote:
Robin H. Johnson wrote:
On Mon, Sep 14, 2020 at 11:40:22AM -0000, Johannes L wrote:
  Hello Ceph-Users

  after upgrading one of our clusters to Nautilus we noticed the x pgs not
scrubbed/deep-scrubbed in time warnings.
  Through some digging we found out that it seems like the scrubbing takes place at random
and doesn't take the age of the last scrub/deep-scrub into consideration.
  I dumped the time of the last scrub with a 90 min gap in between:
  ceph pg dump | grep active | awk '{print $22}' | sort | uniq -c
  dumped all
     2434 2020-08-30
     5935 2020-08-31
     1782 2020-09-01
        2 2020-09-02
        2 2020-09-03
        5 2020-09-06
        3 2020-09-08
        5 2020-09-09
       17 2020-09-10
      259 2020-09-12
    26672 2020-09-13
    12036 2020-09-14

  dumped all
     2434 2020-08-30
     5933 2020-08-31
     1782 2020-09-01
        2 2020-09-02
        2 2020-09-03
        5 2020-09-06
        3 2020-09-08
        5 2020-09-09
       17 2020-09-10
       51 2020-09-12
    24862 2020-09-13
    14056 2020-09-14

  It is pretty obvious that the PGs that have been scrubbed a day ago have been scrubbed
again for some reason while ones that are 2 weeks old are basically left untouched.
  One way we are currently dealing with this issue is setting the osd_scrub_min_interval to
72h to force the cluster to scrub the older PGs.
  This can't be intentional.
  Has anyone else seen this behavior? Yes, this has existed for a long time; but the
warnings are what's new.

- What's your workload? RBD/RGW/CephFS/???
- Is there a pattern to which pools are behind?

At more than one job now, we've have written some tooling that drove the
oldest scrubs in addition or instead of Ceph scheduling scrubs.

The one thing that absolutely stood out in that however, is some PGs
that took much longer than others or never completed (and meant other
PGs on those OSDs also got delayed). I never got to the bottom of why
when I was at my last job, and it hasn't been priority enough at my
current job for the once we saw it (and it may have been a precursor to
a disk failing).

We use our own interface with librados to talk to our cluster and store plain objects.
The workload is mostly Write once read many with object size from 64k to 4M.

I don't see any pattern right now. It looks to me that every PG within the osd scrub begin/end hour/week_day parameters is pooled together and picked at random to scrub which in my eyes does not make any sense since this results in a bunch of PGs beeing scrubbed after a day or so while older PGs wont get touched for days.
It would make far more sense to prioritize PGs that have not been scrubbed for a longer period of time rather than seemingly at random.

I had a cronjob on an old ceph cluster that instructed the four PGs with 
the oldest last deep-scrub time to initiate a deep-scrub:

for strPg in `ceph pg dump 2>&1 | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ 
{print $20, $21, $1}' | sort | head -4 | awk '{ print $3}'`; do ceph pg 
deep-scrub $strPg; done

It is likely this will need to be tweaked for newer versions of ceph 
that give a different 'ceph pg dump' output format, but the concept 
should still be valid

- Mike

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx