Yes these seems consistent with what we are experiencing. We have definitely toggled the noscrub flags in various scenarios in the recent past. Thanks for tracking down and fixing. Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Fri, Jul 15, 2022 at 10:16 PM David Orman <ormandj@xxxxxxxxxxxx> wrote: > Apologies, backport link should be: > https://github.com/ceph/ceph/pull/46845 > > On Fri, Jul 15, 2022 at 9:14 PM David Orman <ormandj@xxxxxxxxxxxx> wrote: > >> I think you may have hit the same bug we encountered. Cory submitted a >> fix, see if it fits what you've encountered: >> >> https://github.com/ceph/ceph/pull/46727 (backport to Pacific here: >> https://github.com/ceph/ceph/pull/46877 ) >> https://tracker.ceph.com/issues/54172 >> >> On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> >> wrote: >> >>> We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9 >>> >>> Another 16.2.7 -> 16.2.9 >>> >>> Both with a multi disk (spinner block / ssd block.db) and both CephFS >>> around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples >>> of >>> stuck scrubbing PGs from all of the pools. >>> >>> They have generally been behind on scrubbing which we attributed to >>> simply >>> being large disks (10TB) with a heavy write load and the OSDs just having >>> trouble keeping up. On closer inspection it appears we have many PGs that >>> have been lodged in a deep scrubbing state on one cluster for 2 weeks and >>> another for 7 weeks. Wondering if others have been experiencing anything >>> similar. The only example of PGs being stuck scrubbing I have seen in the >>> past has been related to snaptrim PG state but we arent doing anything >>> with >>> snapshots in these new clusters. >>> >>> Granted my cluster has been warning me with "pgs not deep-scrubbed in >>> time" >>> and its on me for not looking more closely into why. Perhaps a separate >>> warning of "PG Stuck Scrubbing for greater than 24 hours" or similar >>> might >>> be helpful to an operator. >>> >>> In any case I was able to get scrubs proceeding again by restarting the >>> primary OSD daemon in the PGs which were stuck. Will monitor closely for >>> additional stuck scrubs. >>> >>> >>> Respectfully, >>> >>> *Wes Dillingham* >>> wes@xxxxxxxxxxxxxxxxx >>> LinkedIn <http://www.linkedin.com/in/wesleydillingham> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx