Re: Unable to fix 1 Inconsistent PG

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Wed, 11 Oct 2023 16:38:44 -0400

If I recall correctly When the acting or up_set of an PG changes the scrub
information is lost. This was likely lost when you stopped osd.238 and
changed the sets.

I do not believe based on your initial post you need to be using the
objectstore tool currently. Inconsistent PGs are a common occurrence and
can be repaired.

After your most recent post I would get osd.238 back in the cluster unless
you have reason to believe it is the failing hardware. But it could be any
of the osds in the following set (from your initial post)
[238,106,402,266,374,498,590,627,684,73,66]

You should inspect the SMART data and dmesg on the drives and servers
supporting the above OSDs to determine which one is failing.

After you get the PG back to active+clean+inconsistent (get osd.238 back in
and it finishes its backfill) you can re-issue a manual deep-scrub of it
and once that deep-scrub finishes the rados list-inconsistent-obj 15.f4f
should return and implicate a single osd with errors.

Finally you should issue the PG repair again.

In order to get your manually issued scrubs and repairs to start sooner you
may want to set the noscrub and nodeep-scrub flags until you can get your
PG repaired.

As an aside osd_max_scrubs of 9 is too aggressive IMO I would drop that
back to 3, max

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Wed, Oct 11, 2023 at 10:51 AM Siddhit Renake <tech35.sid@xxxxxxxxx>
wrote:

> Hello Wes,
>
> Thank you for your response.
>
> brc1admin:~ # rados list-inconsistent-obj 15.f4f
> No scrub information available for pg 15.f4f
>
> brc1admin:~ # ceph osd ok-to-stop osd.238
> OSD(s) 238 are ok to stop without reducing availability or risking data,
> provided there are no other concurrent failures or interventions.
> 341 PGs are likely to be degraded (but remain available) as a result.
>
> Before I proceed with your suggested action plan, needed clarification on
> below.
> In order to list all objects residing on the inconsistent PG, we had
> stopped the primary osd (osd.238) and extracted the list of all objects
> residing on this osd using ceph-objectstore tool. We notice that that when
> we stop the osd (osd.238) using systemctl, RGW gateways continuously
> restarts which is impacting our S3 service availability. This was observed
> twice when we stopped osd.238 for general maintenance activity w.r.t
> ceph-objectstore tool. How can we ensure that stopping and marking out
> osd.238 ( primary osd of inconsistent pg) does not impact RGW service
> availability ?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx