Mine repaired themselves after a regular deep scrub. Weird that I couldn't trigger one manually.
On 30 April 2018 at 14:23, David Turner <drakonstein@xxxxxxxxx> wrote:
My 3 inconsistent PGs finally decided to run automatic scrubs and now 2 of the 3 will allow me to run deep-scrubs and repairs on them. The deep-scrub did not show any new information about the objects other than that they were missing in one of the copies. Running a repair fixed the inconsistency.On Tue, Apr 24, 2018 at 4:53 PM David Turner <drakonstein@xxxxxxxxx> wrote:Neither the issue I created nor Michael's [1] ticket that it was rolled into are getting any traction. How are y'all fairing with your clusters? I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now. I assumed that the third PG was just like the first 2 in that it couldn't be scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the third one is able to run scrubs. I'm going to increase the logging on it after I finish a round of maintenance we're performing on some OSDs. Hopefully I'll find something more about these objects.On Fri, Apr 6, 2018 at 12:30 PM David Turner <drakonstein@xxxxxxxxx> wrote:I'm using filestore. I think the root cause is something getting stuck in the code. As such I went ahead and created a [1] bug tracker for this. Hopefully it gets some traction as I'm not particularly looking forward to messing with deleting PGs with the ceph-objectstore-tool in production.On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick <michael.sudnick@xxxxxxxxx> wrote:I've tried a few more things to get a deep-scrub going on my PG. I tried instructing the involved osds to scrub all their PGs and it looks like that didn't do it.Do you have any documentation on the object-store-tool? What I've found online talks about filestore and not bluestore.On 6 April 2018 at 09:27, David Turner <drakonstein@xxxxxxxxx> wrote:I'm running into this exact same situation. I'm running 12.2.2 and I have an EC PG with a scrub error. It has the same output for [1] rados list-inconsistent-obj as mentioned before. This is the [2] full health detail. This is the [3] excerpt from the log from the deep-scrub that marked the PG inconsistent. The scrub happened when the PG was starting up after using ceph-objectstore-tool to split its filestore subfolders. This is using a script that I've used for months without any side effects.I have tried quite a few things to get this PG to deep-scrub or repair, but to no avail. It will not do anything. I have set every osd's osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep scrubbing to finish, then increased the 11 OSDs for this PG to 1 before issuing a deep-scrub. And it will sit there for over an hour without deep-scrubbing. My current testing of this is to set all osds to 1, increase all of the osds for this PG to 4, and then issue the repair... but similarly nothing happens. Each time I issue the deep-scrub or repair, the output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but nothing shows up in the log for the OSD and the PG state stays 'active+clean+inconsistent'.My next step, unless anyone has a better idea, is to find the exact copy of the PG with the missing object, use object-store-tool to back up that copy of the PG and remove it. Then starting the OSD back up should backfill the full copy of the PG and be healthy again.[1] $ rados list-inconsistent-obj 145.2e3No scrub information available for pg 145.2e3error 2: (2) No such file or directory[2] $ ceph health detailHEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 1 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistentpg 145.2e3 is active+clean+inconsistent, acting [234,132,33,331,278,217,55,358,79,3,24] [3] 2018-04-04 15:24:53.603380 7f54d1820700 0 log_channel(cluster) log [DBG] : 145.2e3 deep-scrub starts2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log [ERR] : 145.2e3 deep-scrub 1 errorsOn Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <michael.sudnick@xxxxxxxxx> wrote:Hi Kjetil,I've tried to get the pg scrubbing/deep scrubbing and nothing seems to be happening. I've tried it a few times over the last few days. My cluster is recovering from a failed disk (which was probably the reason for the inconsistency), do I need to wait for the cluster to heal before repair/deep scrub works?-Michael______________________________On 2 April 2018 at 14:13, Kjetil Joergensen <kjetil@xxxxxxxxxxxx> wrote:Hi,scrub or deep-scrub the pg, that should in theory get you back to list-inconsistent-obj spitting out what's wrong, then mail that info to the list.-KJOn Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <michael.sudnick@xxxxxxxxx> wrote:______________________________-MichaelDoes anyone have an suggestions?I'm a bit at a loss here as what to do to recover. That pg is part of a cephfs_data pool with compression set to force/snappy.Hello,I have a small cluster with an inconsistent pg. I've tried ceph pg repair multiple times to no luck. rados list-inconsistent-obj 49.11c returns:
# rados list-inconsistent-obj 49.11c
No scrub information available for pg 49.11c
error 2: (2) No such file or directory_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
--Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com