Hi,
our cluster is running pacific 16.2.10. Since the upgrade the clusters
starts to report an increasing number of PG without a timely deep-scrub:
# ceph -s
cluster:
id: XXXX
health: HEALTH_WARN
1073 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum XXX,XXX,XXX (age 10d)
mgr: XXX(active, since 3w), standbys: XXX, XXX
mds: 2/2 daemons up, 2 standby
osd: 460 osds: 459 up (since 3d), 459 in (since 5d)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 2/2 healthy
pools: 16 pools, 5073 pgs
objects: 733.76M objects, 1.1 PiB
usage: 1.6 PiB used, 3.3 PiB / 4.9 PiB avail
pgs: 4941 active+clean
105 active+clean+scrubbing
27 active+clean+scrubbing+deep
The cluster is healthy otherwise, with the exception of one failed OSD.
It has been marked out and should not interfere with scrubbing.
Scrubbing itself is running, but there are too few deep-scrubs. If I
remember correctly we had a larger number of deep scrubs before the last
upgrade. It tried to extend the deep-scrub interval, but to no avail yet.
The majority of PGs is part of a ceph data pool (4096 of 4941 pgs), and
those are also most of the pgs reported. The pool is backed by 12
machines with 48 disks each, so there should be enough I/O capacity for
running deep-scrubs. Load on these machines and disks is also pretty low.
Any hints on debugging this? The number of affected PGs has rising from
600 to over 1000 during the weekend and continues to rise...
Best regards,
Burkhard Linke
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx