In our case it was with a EC pool as well. I believe the PG state was degraded+recovering / recovery_wait and iirc the PGs just simply sat in the recovering state without any progress (degraded PG object count did not decline). A repeer of the PG was attempted but no success there. A restart of all the OSDs for the given PGs was attempted under mclock. That didnt work. Switching to wpq for all OSDS in the given PG did resolve the issue. This was on a 17.2.7 cluster. Respectfully, *Wes Dillingham* LinkedIn <http://www.linkedin.com/in/wesleydillingham> wes@xxxxxxxxxxxxxxxxx On Thu, May 2, 2024 at 9:54 AM Sridhar Seshasayee <sseshasa@xxxxxxxxxx> wrote: > > > > Multiple people -- including me -- have also observed backfill/recovery > > stop completely for no apparent reason. > > > > In some cases poking the lead OSD for a PG with `ceph osd down` restores, > > in other cases it doesn't. > > > > Anecdotally this *may* only happen for EC pools on HDDs but that sample > > size is small. > > > > > Thanks for the information. We will try and reproduce this locally with EC > pools and investigate this further. > I will revert with a tracker for this. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx