Re: Ceph reef and (slow) backfilling - how to speed it up

Matthew Darwin <bugs@xxxxxxxxxx> · Fri, 10 May 2024 21:50:53 -0400

We have had pgs get stuck  in quincy (17.2.7).  After changing to wpq, 
no such problems were observed.  We're using a replicated (x3) pool.

On 2024-05-02 10:02, Wesley Dillingham wrote:
In our case it was with a EC pool as well. I believe the PG state was
degraded+recovering / recovery_wait and iirc the PGs just simply sat in the
recovering state without any progress (degraded PG object count did not
decline). A repeer of the PG was attempted but no success there. A restart
of all the OSDs for the given PGs was attempted under mclock. That didnt
work. Switching to wpq for all OSDS in the given PG did resolve the issue.
This was on a 17.2.7 cluster.

Respectfully,

*Wes Dillingham*
LinkedIn<http://www.linkedin.com/in/wesleydillingham>
wes@xxxxxxxxxxxxxxxxx

On Thu, May 2, 2024 at 9:54 AM Sridhar Seshasayee<sseshasa@xxxxxxxxxx>
wrote:

Multiple people -- including me -- have also observed backfill/recovery
stop completely for no apparent reason.

In some cases poking the lead OSD for a PG with `ceph osd down` restores,
in other cases it doesn't.

Anecdotally this *may* only happen for EC pools on HDDs but that sample
size is small.

Thanks for the information. We will try and reproduce this locally with EC
pools and investigate this further.
I will revert with a tracker for this.
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx