Re: Increasing number of unscrubbed PGs

Eugen Block <eblock@xxxxxx> · Mon, 12 Sep 2022 09:44:49 +0000

Hi,

I'm still not sure why increasing the interval doesn't help (maybe  
there's some flag set to the PG or something), but you could just  
increase osd_max_scrubs if your OSDs are not too busy. On one customer  
cluster with high load during the day we configured the scrubs to run  
during the night but then with osd_max_scrubs = 6. What is your  
current value for osd_max_scrubs?

Regards,
Eugen

Zitat von Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:

Hi,

our cluster is running pacific 16.2.10. Since the upgrade the  
clusters starts to report an increasing number of PG without a  
timely deep-scrub:

# ceph -s
  cluster:
    id:    XXXX
    health: HEALTH_WARN
            1073 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum XXX,XXX,XXX (age 10d)
    mgr: XXX(active, since 3w), standbys: XXX, XXX
    mds: 2/2 daemons up, 2 standby
    osd: 460 osds: 459 up (since 3d), 459 in (since 5d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 2/2 healthy
    pools:   16 pools, 5073 pgs
    objects: 733.76M objects, 1.1 PiB
    usage:   1.6 PiB used, 3.3 PiB / 4.9 PiB avail
    pgs:     4941 active+clean
             105  active+clean+scrubbing
             27   active+clean+scrubbing+deep

The cluster is healthy otherwise, with the exception of one failed  
OSD. It has been marked out and should not interfere with scrubbing.  
Scrubbing itself is running, but there are too few deep-scrubs. If I  
remember correctly we had a larger number of deep scrubs before the  
last upgrade. It tried to extend the deep-scrub interval, but to no  
avail yet.

The majority of PGs is part of a ceph data pool (4096 of 4941 pgs),  
and those are also most of the pgs reported. The pool is backed by  
12 machines with 48 disks each, so there should be enough I/O  
capacity for running deep-scrubs. Load on these machines and disks  
is also pretty low.

Any hints on debugging this? The number of affected PGs has rising  
from 600 to over 1000 during the weekend and continues to rise...

Best regards,

Burkhard Linke

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx