Re: How to repair active+clean+inconsistent?

Brad Hubbard <bhubbard@xxxxxxxxxx> · Thu, 15 Nov 2018 09:19:14 +1000



You could try a 'rados get' and then a 'rados put' on the object to start with.
On Thu, Nov 15, 2018 at 4:07 AM K.C. Wong <kcwong@xxxxxxxxxxx> wrote:
>
> So, I’ve issued the deep-scrub command (and the repair command)
> and nothing seems to happen.
> Unrelated to this issue, I have to take down some OSD to prepare
> a host for RMA. One of them happens to be in the replication
> group for this PG. So, a scrub happened indirectly. I now have
> this from “ceph -s”:
>
>     cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76
>      health HEALTH_ERR
>             1 pgs inconsistent
>             18446 scrub errors
>      monmap e1: 3 mons at {mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0}
>             election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03
>       fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby
>      osdmap e40248: 120 osds: 119 up, 119 in
>             flags sortbitwise,require_jewel_osds
>       pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects
>             59473 GB used, 287 TB / 345 TB avail
>                 3120 active+clean
>                   15 active+clean+scrubbing+deep
>                    1 active+clean+inconsistent
>
> That’s a lot of scrub errors:
>
> HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors
> pg 1.65 is active+clean+inconsistent, acting [62,67,33]
> 18446 scrub errors
>
> Now, “rados list-inconsistent-obj 1.65” returns a *very* long JSON
> output. Here’s a very small snippet, the errors look the same across:
>
> {
>   “object”:{
>     "name":”100000ea8bb.00000045”,
>     "nspace":”",
>     "locator":”",
>     "snap":"head”,
>     "version”:59538
>   },
>   "errors":["attr_name_mismatch”],
>   "union_shard_errors":["oi_attr_missing”],
>   "selected_object_info":"1:a70dc1cc:::100000ea8bb.00000045:head(2897'59538 client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd f437a612 od ffffffff alloc_hint [0 0])”,
>   "shards”:[
>     {
>       "osd":33,
>       "errors":[],
>       "size":4194304,
>       "omap_digest”:"0xffffffff”,
>       "data_digest”:"0xf437a612”,
>       "attrs":[
>         {"name":"_”,
>          "value":”EAgNAQAABAM1AA...“,
>          "Base64":true},
>         {"name":"snapset”,
>          "value":”AgIZAAAAAQAAAA...“,
>          "Base64":true}
>       ]
>     },
>     {
>       "osd":62,
>       "errors":[],
>       "size":4194304,
>       "omap_digest":"0xffffffff”,
>       "data_digest":"0xf437a612”,
>       "attrs”:[
>         {"name":"_”,
>          "value":”EAgNAQAABAM1AA...",
>          "Base64":true},
>         {"name":"snapset”,
>          "value":”AgIZAAAAAQAAAA…",
>          "Base64":true}
>       ]
>     },
>     {
>       "osd":67,
>       "errors":["oi_attr_missing”],
>       "size":4194304,
>       "omap_digest":"0xffffffff”,
>       "data_digest":"0xf437a612”,
>       "attrs":[]
>     }
>   ]
> }
>
> Clearly, on osd.67, the “attrs” array is empty. The question is,
> how do I fix this?
>
> Many thanks in advance,
>
> -kc
>
> K.C. Wong
> kcwong@xxxxxxxxxxx
> M: +1 (408) 769-8235
>
> -----------------------------------------------------
> Confidentiality Notice:
> This message contains confidential information. If you are not the
> intended recipient and received this message in error, any use or
> distribution is strictly prohibited. Please also notify us
> immediately by return e-mail, and delete this message from your
> computer system. Thank you.
> -----------------------------------------------------
>
> 4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
>
> hkps://hkps.pool.sks-keyservers.net
>
> On Nov 11, 2018, at 10:58 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>
> On Mon, Nov 12, 2018 at 4:21 PM Ashley Merrick <singapore@xxxxxxxxxxxxxx> wrote:
>
>
> Your need to run "ceph pg deep-scrub 1.65" first
>
>
> Right, thanks Ashley. That's what the "Note that you may have to do a
> deep scrub to populate the output." part of my answer meant but
> perhaps I needed to go further?
>
> The system has a record of a scrub error on a previous scan but
> subsequent activity in the cluster has invalidated the specifics. You
> need to run another scrub to get the specific information for this pg
> at this point in time (the information does not remain valid
> indefinitely and therefore may need to be renewed depending on
> circumstances).
>
>
> On Mon, Nov 12, 2018 at 2:20 PM K.C. Wong <kcwong@xxxxxxxxxxx> wrote:
>
>
> Hi Brad,
>
> I got the following:
>
> [root@mgmt01 ~]# ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 1.65 is active+clean+inconsistent, acting [62,67,47]
> 1 scrub errors
> [root@mgmt01 ~]# rados list-inconsistent-obj 1.65
> No scrub information available for pg 1.65
> error 2: (2) No such file or directory
> [root@mgmt01 ~]# rados list-inconsistent-snapset 1.65
> No scrub information available for pg 1.65
> error 2: (2) No such file or directory
>
> Rather odd output, I’d say; not that I understand what
> that means. I also tried ceph list-inconsistent-pg:
>
> [root@mgmt01 ~]# rados lspools
> rbd
> cephfs_data
> cephfs_metadata
> .rgw.root
> default.rgw.control
> default.rgw.data.root
> default.rgw.gc
> default.rgw.log
> ctrl-p
> prod
> corp
> camp
> dev
> default.rgw.users.uid
> default.rgw.users.keys
> default.rgw.buckets.index
> default.rgw.buckets.data
> default.rgw.buckets.non-ec
> [root@mgmt01 ~]# for i in $(rados lspools); do rados list-inconsistent-pg $i; done
> []
> ["1.65"]
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
> []
>
> So, that’d put the inconsistency in the cephfs_data pool.
>
> Thank you for your help,
>
> -kc
>
> K.C. Wong
> kcwong@xxxxxxxxxxx
> M: +1 (408) 769-8235
>
> -----------------------------------------------------
> Confidentiality Notice:
> This message contains confidential information. If you are not the
> intended recipient and received this message in error, any use or
> distribution is strictly prohibited. Please also notify us
> immediately by return e-mail, and delete this message from your
> computer system. Thank you.
> -----------------------------------------------------
>
> 4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
>
> hkps://hkps.pool.sks-keyservers.net
>
> On Nov 11, 2018, at 5:43 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>
> What does "rados list-inconsistent-obj <pg>" say?
>
> Note that you may have to do a deep scrub to populate the output.
> On Mon, Nov 12, 2018 at 5:10 AM K.C. Wong <kcwong@xxxxxxxxxxx> wrote:
>
>
> Hi folks,
>
> I would appreciate any pointer as to how I can resolve a
> PG stuck in “active+clean+inconsistent” state. This has
> resulted in HEALTH_ERR status for the last 5 days with no
> end in sight. The state got triggered when one of the drives
> in the PG returned I/O error. I’ve since replaced the failed
> drive.
>
> I’m running Jewel (out of centos-release-ceph-jewel) on
> CentOS 7. I’ve tried “ceph pg repair <pg>” and it didn’t seem
> to do anything. I’ve tried even more drastic measures such as
> comparing all the files (using filestore) under that PG_head
> on all 3 copies and then nuking the outlier. Nothing worked.
>
> Many thanks,
>
> -kc
>
> K.C. Wong
> kcwong@xxxxxxxxxxx
> M: +1 (408) 769-8235
>
> -----------------------------------------------------
> Confidentiality Notice:
> This message contains confidential information. If you are not the
> intended recipient and received this message in error, any use or
> distribution is strictly prohibited. Please also notify us
> immediately by return e-mail, and delete this message from your
> computer system. Thank you.
> -----------------------------------------------------
> 4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
> hkps://hkps.pool.sks-keyservers.net
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Cheers,
> Brad
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>


-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com