Increasing number of unscrubbed PGs

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 12 Sep 2022 09:35:33 +0200

Hi,

our cluster is running pacific 16.2.10. Since the upgrade the clusters 
starts to report an increasing number of PG without a timely deep-scrub:

# ceph -s
  cluster:
    id:    XXXX
    health: HEALTH_WARN
            1073 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum XXX,XXX,XXX (age 10d)
    mgr: XXX(active, since 3w), standbys: XXX, XXX
    mds: 2/2 daemons up, 2 standby
    osd: 460 osds: 459 up (since 3d), 459 in (since 5d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 2/2 healthy
    pools:   16 pools, 5073 pgs
    objects: 733.76M objects, 1.1 PiB
    usage:   1.6 PiB used, 3.3 PiB / 4.9 PiB avail
    pgs:     4941 active+clean
             105  active+clean+scrubbing
             27   active+clean+scrubbing+deep

The cluster is healthy otherwise, with the exception of one failed OSD. 
It has been marked out and should not interfere with scrubbing. 
Scrubbing itself is running, but there are too few deep-scrubs. If I 
remember correctly we had a larger number of deep scrubs before the last 
upgrade. It tried to extend the deep-scrub interval, but to no avail yet.

The majority of PGs is part of a ceph data pool (4096 of 4941 pgs), and 
those are also most of the pgs reported. The pool is backed by 12 
machines with 48 disks each, so there should be enough I/O capacity for 
running deep-scrubs. Load on these machines and disks is also pretty low.

Any hints on debugging this? The number of affected PGs has rising from 
600 to over 1000 during the weekend and continues to rise...

Best regards,

Burkhard Linke

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx