PGs stuck deep-scrubbing for weeks - 16.2.9

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Fri, 15 Jul 2022 09:51:52 -0400

We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9

Another 16.2.7 -> 16.2.9

Both with a multi disk (spinner block / ssd block.db) and both CephFS
around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples of
stuck scrubbing PGs from all of the pools.

They have generally been behind on scrubbing which we attributed to simply
being large disks (10TB) with a heavy write load and the OSDs just having
trouble keeping up. On closer inspection it appears we have many PGs that
have been lodged in a deep scrubbing state on one cluster for 2 weeks and
another for 7 weeks. Wondering if others have been experiencing anything
similar. The only example of PGs being stuck scrubbing I have seen in the
past has been related to snaptrim PG state but we arent doing anything with
snapshots in these new clusters.

Granted my cluster has been warning me with "pgs not deep-scrubbed in time"
and its on me for not looking more closely into why. Perhaps a separate
warning of "PG Stuck Scrubbing for greater than 24 hours" or similar might
be helpful to an operator.

In any case I was able to get scrubs proceeding again by restarting the
primary OSD daemon in the PGs which were stuck. Will monitor closely for
additional stuck scrubs.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx