Re: PGs stuck deep-scrubbing for weeks - 16.2.9

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Mon, 18 Jul 2022 10:39:19 -0400

Yes these seems consistent with what we are experiencing. We have
definitely toggled the noscrub flags in various scenarios in the recent
past. Thanks for tracking down and fixing.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Fri, Jul 15, 2022 at 10:16 PM David Orman <ormandj@xxxxxxxxxxxx> wrote:

> Apologies, backport link should be:
> https://github.com/ceph/ceph/pull/46845
>
> On Fri, Jul 15, 2022 at 9:14 PM David Orman <ormandj@xxxxxxxxxxxx> wrote:
>
>> I think you may have hit the same bug we encountered. Cory submitted a
>> fix, see if it fits what you've encountered:
>>
>> https://github.com/ceph/ceph/pull/46727 (backport to Pacific here:
>> https://github.com/ceph/ceph/pull/46877 )
>> https://tracker.ceph.com/issues/54172
>>
>> On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
>> wrote:
>>
>>> We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9
>>>
>>> Another 16.2.7 -> 16.2.9
>>>
>>> Both with a multi disk (spinner block / ssd block.db) and both CephFS
>>> around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples
>>> of
>>> stuck scrubbing PGs from all of the pools.
>>>
>>> They have generally been behind on scrubbing which we attributed to
>>> simply
>>> being large disks (10TB) with a heavy write load and the OSDs just having
>>> trouble keeping up. On closer inspection it appears we have many PGs that
>>> have been lodged in a deep scrubbing state on one cluster for 2 weeks and
>>> another for 7 weeks. Wondering if others have been experiencing anything
>>> similar. The only example of PGs being stuck scrubbing I have seen in the
>>> past has been related to snaptrim PG state but we arent doing anything
>>> with
>>> snapshots in these new clusters.
>>>
>>> Granted my cluster has been warning me with "pgs not deep-scrubbed in
>>> time"
>>> and its on me for not looking more closely into why. Perhaps a separate
>>> warning of "PG Stuck Scrubbing for greater than 24 hours" or similar
>>> might
>>> be helpful to an operator.
>>>
>>> In any case I was able to get scrubs proceeding again by restarting the
>>> primary OSD daemon in the PGs which were stuck. Will monitor closely for
>>> additional stuck scrubs.
>>>
>>>
>>> Respectfully,
>>>
>>> *Wes Dillingham*
>>> wes@xxxxxxxxxxxxxxxxx
>>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx