You could try a 'rados get' and then a 'rados put' on the object to start with. On Thu, Nov 15, 2018 at 4:07 AM K.C. Wong <kcwong@xxxxxxxxxxx> wrote: > > So, I’ve issued the deep-scrub command (and the repair command) > and nothing seems to happen. > Unrelated to this issue, I have to take down some OSD to prepare > a host for RMA. One of them happens to be in the replication > group for this PG. So, a scrub happened indirectly. I now have > this from “ceph -s”: > > cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76 > health HEALTH_ERR > 1 pgs inconsistent > 18446 scrub errors > monmap e1: 3 mons at {mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0} > election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03 > fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby > osdmap e40248: 120 osds: 119 up, 119 in > flags sortbitwise,require_jewel_osds > pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects > 59473 GB used, 287 TB / 345 TB avail > 3120 active+clean > 15 active+clean+scrubbing+deep > 1 active+clean+inconsistent > > That’s a lot of scrub errors: > > HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors > pg 1.65 is active+clean+inconsistent, acting [62,67,33] > 18446 scrub errors > > Now, “rados list-inconsistent-obj 1.65” returns a *very* long JSON > output. Here’s a very small snippet, the errors look the same across: > > { > “object”:{ > "name":”100000ea8bb.00000045”, > "nspace":”", > "locator":”", > "snap":"head”, > "version”:59538 > }, > "errors":["attr_name_mismatch”], > "union_shard_errors":["oi_attr_missing”], > "selected_object_info":"1:a70dc1cc:::100000ea8bb.00000045:head(2897'59538 client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd f437a612 od ffffffff alloc_hint [0 0])”, > "shards”:[ > { > "osd":33, > "errors":[], > "size":4194304, > "omap_digest”:"0xffffffff”, > "data_digest”:"0xf437a612”, > "attrs":[ > {"name":"_”, > "value":”EAgNAQAABAM1AA...“, > "Base64":true}, > {"name":"snapset”, > "value":”AgIZAAAAAQAAAA...“, > "Base64":true} > ] > }, > { > "osd":62, > "errors":[], > "size":4194304, > "omap_digest":"0xffffffff”, > "data_digest":"0xf437a612”, > "attrs”:[ > {"name":"_”, > "value":”EAgNAQAABAM1AA...", > "Base64":true}, > {"name":"snapset”, > "value":”AgIZAAAAAQAAAA…", > "Base64":true} > ] > }, > { > "osd":67, > "errors":["oi_attr_missing”], > "size":4194304, > "omap_digest":"0xffffffff”, > "data_digest":"0xf437a612”, > "attrs":[] > } > ] > } > > Clearly, on osd.67, the “attrs” array is empty. The question is, > how do I fix this? > > Many thanks in advance, > > -kc > > K.C. Wong > kcwong@xxxxxxxxxxx > M: +1 (408) 769-8235 > > ----------------------------------------------------- > Confidentiality Notice: > This message contains confidential information. If you are not the > intended recipient and received this message in error, any use or > distribution is strictly prohibited. Please also notify us > immediately by return e-mail, and delete this message from your > computer system. Thank you. > ----------------------------------------------------- > > 4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE > > hkps://hkps.pool.sks-keyservers.net > > On Nov 11, 2018, at 10:58 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > > On Mon, Nov 12, 2018 at 4:21 PM Ashley Merrick <singapore@xxxxxxxxxxxxxx> wrote: > > > Your need to run "ceph pg deep-scrub 1.65" first > > > Right, thanks Ashley. That's what the "Note that you may have to do a > deep scrub to populate the output." part of my answer meant but > perhaps I needed to go further? > > The system has a record of a scrub error on a previous scan but > subsequent activity in the cluster has invalidated the specifics. You > need to run another scrub to get the specific information for this pg > at this point in time (the information does not remain valid > indefinitely and therefore may need to be renewed depending on > circumstances). > > > On Mon, Nov 12, 2018 at 2:20 PM K.C. Wong <kcwong@xxxxxxxxxxx> wrote: > > > Hi Brad, > > I got the following: > > [root@mgmt01 ~]# ceph health detail > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > pg 1.65 is active+clean+inconsistent, acting [62,67,47] > 1 scrub errors > [root@mgmt01 ~]# rados list-inconsistent-obj 1.65 > No scrub information available for pg 1.65 > error 2: (2) No such file or directory > [root@mgmt01 ~]# rados list-inconsistent-snapset 1.65 > No scrub information available for pg 1.65 > error 2: (2) No such file or directory > > Rather odd output, I’d say; not that I understand what > that means. I also tried ceph list-inconsistent-pg: > > [root@mgmt01 ~]# rados lspools > rbd > cephfs_data > cephfs_metadata > .rgw.root > default.rgw.control > default.rgw.data.root > default.rgw.gc > default.rgw.log > ctrl-p > prod > corp > camp > dev > default.rgw.users.uid > default.rgw.users.keys > default.rgw.buckets.index > default.rgw.buckets.data > default.rgw.buckets.non-ec > [root@mgmt01 ~]# for i in $(rados lspools); do rados list-inconsistent-pg $i; done > [] > ["1.65"] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > [] > > So, that’d put the inconsistency in the cephfs_data pool. > > Thank you for your help, > > -kc > > K.C. Wong > kcwong@xxxxxxxxxxx > M: +1 (408) 769-8235 > > ----------------------------------------------------- > Confidentiality Notice: > This message contains confidential information. If you are not the > intended recipient and received this message in error, any use or > distribution is strictly prohibited. Please also notify us > immediately by return e-mail, and delete this message from your > computer system. Thank you. > ----------------------------------------------------- > > 4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE > > hkps://hkps.pool.sks-keyservers.net > > On Nov 11, 2018, at 5:43 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > > What does "rados list-inconsistent-obj <pg>" say? > > Note that you may have to do a deep scrub to populate the output. > On Mon, Nov 12, 2018 at 5:10 AM K.C. Wong <kcwong@xxxxxxxxxxx> wrote: > > > Hi folks, > > I would appreciate any pointer as to how I can resolve a > PG stuck in “active+clean+inconsistent” state. This has > resulted in HEALTH_ERR status for the last 5 days with no > end in sight. The state got triggered when one of the drives > in the PG returned I/O error. I’ve since replaced the failed > drive. > > I’m running Jewel (out of centos-release-ceph-jewel) on > CentOS 7. I’ve tried “ceph pg repair <pg>” and it didn’t seem > to do anything. I’ve tried even more drastic measures such as > comparing all the files (using filestore) under that PG_head > on all 3 copies and then nuking the outlier. Nothing worked. > > Many thanks, > > -kc > > K.C. Wong > kcwong@xxxxxxxxxxx > M: +1 (408) 769-8235 > > ----------------------------------------------------- > Confidentiality Notice: > This message contains confidential information. If you are not the > intended recipient and received this message in error, any use or > distribution is strictly prohibited. Please also notify us > immediately by return e-mail, and delete this message from your > computer system. Thank you. > ----------------------------------------------------- > 4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE > hkps://hkps.pool.sks-keyservers.net > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Cheers, > Brad > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com