Re: Erasure-coded PG stuck in the failed_repair state

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Tue, 10 May 2022 09:47:33 -0400

In my experience:

"No scrub information available for pg 11.2b5
error 2: (2) No such file or directory"

is the output you get from the command when the up or acting osd set has
changed since the last deep-scrub. Have you tried to run a deep scrub (ceph
pg deep-scrub 11.2b5) on the pg and then try "rados list-inconsistent-obj
11.2b5" again. I do recognize that part of the pg repair also performs a
deep-scrub but perhaps the deep-scrub alone will help with your attempt to
run rados list-inconsistent-obj.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Tue, May 10, 2022 at 8:52 AM Robert Appleyard - STFC UKRI <
rob.appleyard@xxxxxxxxxx> wrote:

> Hi,
>
> We've got an outstanding issue with one of our Ceph clusters here at RAL.
> The cluster is 'Echo', our 40PB cluster. We found an object from an 8+3EC
> RGW pool in the failed_repair state. We aren't sure how the object got into
> this state, but it doesn't appear to be a case of correlated drive failure
> (the rest of the PG is fine). However, the detail of how we got into this
> state isn't our focus, it's how to get the PG back to a clean state.
>
> The object (for our purposes, named OBJNAME) in question is from a RadosGW
> data pool. It presented initially as a PG in the failed_repair state.
> Repeated attempts to get the PG to repair failed. At this point we
> contacted the user who owns the data, and determined that the data in
> question was also stored elsewhere and so we could safely delete the
> object. We did that using radosgw-admin object rm OBJNAME, and confirmed
> that the object is gone with various approaches (radosgw-admin object stat,
> rados ls --pgid PGID | grep OBJNAME).
>
> So far, so good. Except, even after the object was deleted and in spite of
> many instructions to repair, the placement group is still in the state
> active+clean+inconsistent+failed_repair, and the cluster won't go to
> HEALTH_OK. Here's what the log from one of these repair attempts looks like
> (from the log on the primary OSD).
>
> 2022-05-08 16:23:43.898 7f79d3872700  0 log_channel(cluster) log [DBG] :
> 11.2b5 repair starts
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 1899(8) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 1911(7) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 2842(10) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 3256(6) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 3399(5) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 3770(9) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 5206(3) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 shard 6047(4) soid 11:ad45a433:::OBJNAME:head : candidate had an ec
> size mismatch
> 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 soid 11:ad45a433:::OBJNAME:head : failed to pick suitable object info
> 2022-05-08 19:03:12.690 7f79d3872700 -1 log_channel(cluster) log [ERR] :
> 11.2b5 repair 11 errors, 0 fixed
>
> Looking for inconsistent objects in the PG doesn't report anything odd
> about this object (right now we get this rather odd output, but aren't sure
> that this isn't a red herring).
>
> [root@ceph-adm1 ~]# rados list-inconsistent-obj 11.2b5
> No scrub information available for pg 11.2b5
> error 2: (2) No such file or directory
>
> We don't get this output from this command on any other PG that we've
> tried.
>
> So what next? To reiterate, this isn't about data recovery, it's about
> getting the cluster back to a healthy state. I should also note that this
> issue doesn't seem to be impacting the cluster beyond making that PG show
> as being in a bad state.
>
> Rob Appleyard
>
>
> This email and any attachments are intended solely for the use of the
> named recipients. If you are not the intended recipient you must not use,
> disclose, copy or distribute this email or any of its attachments and
> should notify the sender immediately and delete this email from your
> system. UK Research and Innovation (UKRI) has taken every reasonable
> precaution to minimise risk of this email or any attachments containing
> viruses or malware but the recipient should carry out its own virus and
> malware checks before opening the attachments. UKRI does not accept any
> liability for any losses or damages which the recipient may sustain due to
> presence of any viruses.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx