We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9 Another 16.2.7 -> 16.2.9 Both with a multi disk (spinner block / ssd block.db) and both CephFS around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples of stuck scrubbing PGs from all of the pools. They have generally been behind on scrubbing which we attributed to simply being large disks (10TB) with a heavy write load and the OSDs just having trouble keeping up. On closer inspection it appears we have many PGs that have been lodged in a deep scrubbing state on one cluster for 2 weeks and another for 7 weeks. Wondering if others have been experiencing anything similar. The only example of PGs being stuck scrubbing I have seen in the past has been related to snaptrim PG state but we arent doing anything with snapshots in these new clusters. Granted my cluster has been warning me with "pgs not deep-scrubbed in time" and its on me for not looking more closely into why. Perhaps a separate warning of "PG Stuck Scrubbing for greater than 24 hours" or similar might be helpful to an operator. In any case I was able to get scrubs proceeding again by restarting the primary OSD daemon in the PGs which were stuck. Will monitor closely for additional stuck scrubs. Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx