In my experience: "No scrub information available for pg 11.2b5 error 2: (2) No such file or directory" is the output you get from the command when the up or acting osd set has changed since the last deep-scrub. Have you tried to run a deep scrub (ceph pg deep-scrub 11.2b5) on the pg and then try "rados list-inconsistent-obj 11.2b5" again. I do recognize that part of the pg repair also performs a deep-scrub but perhaps the deep-scrub alone will help with your attempt to run rados list-inconsistent-obj. Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Tue, May 10, 2022 at 8:52 AM Robert Appleyard - STFC UKRI < rob.appleyard@xxxxxxxxxx> wrote: > Hi, > > We've got an outstanding issue with one of our Ceph clusters here at RAL. > The cluster is 'Echo', our 40PB cluster. We found an object from an 8+3EC > RGW pool in the failed_repair state. We aren't sure how the object got into > this state, but it doesn't appear to be a case of correlated drive failure > (the rest of the PG is fine). However, the detail of how we got into this > state isn't our focus, it's how to get the PG back to a clean state. > > The object (for our purposes, named OBJNAME) in question is from a RadosGW > data pool. It presented initially as a PG in the failed_repair state. > Repeated attempts to get the PG to repair failed. At this point we > contacted the user who owns the data, and determined that the data in > question was also stored elsewhere and so we could safely delete the > object. We did that using radosgw-admin object rm OBJNAME, and confirmed > that the object is gone with various approaches (radosgw-admin object stat, > rados ls --pgid PGID | grep OBJNAME). > > So far, so good. Except, even after the object was deleted and in spite of > many instructions to repair, the placement group is still in the state > active+clean+inconsistent+failed_repair, and the cluster won't go to > HEALTH_OK. Here's what the log from one of these repair attempts looks like > (from the log on the primary OSD). > > 2022-05-08 16:23:43.898 7f79d3872700 0 log_channel(cluster) log [DBG] : > 11.2b5 repair starts > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 1899(8) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 1911(7) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 2842(10) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 3256(6) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 3399(5) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 3770(9) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 5206(3) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 shard 6047(4) soid 11:ad45a433:::OBJNAME:head : candidate had an ec > size mismatch > 2022-05-08 16:51:38.807 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 soid 11:ad45a433:::OBJNAME:head : failed to pick suitable object info > 2022-05-08 19:03:12.690 7f79d3872700 -1 log_channel(cluster) log [ERR] : > 11.2b5 repair 11 errors, 0 fixed > > Looking for inconsistent objects in the PG doesn't report anything odd > about this object (right now we get this rather odd output, but aren't sure > that this isn't a red herring). > > [root@ceph-adm1 ~]# rados list-inconsistent-obj 11.2b5 > No scrub information available for pg 11.2b5 > error 2: (2) No such file or directory > > We don't get this output from this command on any other PG that we've > tried. > > So what next? To reiterate, this isn't about data recovery, it's about > getting the cluster back to a healthy state. I should also note that this > issue doesn't seem to be impacting the cluster beyond making that PG show > as being in a bad state. > > Rob Appleyard > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx