Re: How to repair active+clean+inconsistent?

"K.C. Wong" <kcwong@xxxxxxxxxxx> · Wed, 14 Nov 2018 10:07:08 -0800

So, I’ve issued the deep-scrub command (and the repair command)and nothing seems to happen.
Unrelated to this issue, I have to take down some OSD to prepare
a host for RMA. One of them happens to be in the replication
group for this PG. So, a scrub happened indirectly. I now have
this from “ceph -s”:

    cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76
     health HEALTH_ERR
            1 pgs inconsistent
            18446 scrub errors
     monmap e1: 3 mons at {mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0}
            election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03
      fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby
     osdmap e40248: 120 osds: 119 up, 119 in
            flags sortbitwise,require_jewel_osds
      pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects
            59473 GB used, 287 TB / 345 TB avail
                3120 active+clean
                  15 active+clean+scrubbing+deep
                   1 active+clean+inconsistent

That’s a lot of scrub errors:

HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,33]
18446 scrub errors

Now, “rados list-inconsistent-obj 1.65” returns a *very* long JSON
output. Here’s a very small snippet, the errors look the same across:

{
  “object”:{
    "name":”100000ea8bb.00000045”,
    "nspace":”",
    "locator":”",
    "snap":"head”,
    "version”:59538
  },
  "errors":["attr_name_mismatch”],
  "union_shard_errors":["oi_attr_missing”],
  "selected_object_info":"1:a70dc1cc:::100000ea8bb.00000045:head(2897'59538 client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd f437a612 od ffffffff alloc_hint [0 0])”,
  "shards”:[
    {
      "osd":33,
      "errors":[],
      "size":4194304,
      "omap_digest”:"0xffffffff”,
      "data_digest”:"0xf437a612”,
      "attrs":[
        {"name":"_”,
         "value":”EAgNAQAABAM1AA...“,
         "Base64":true},
        {"name":"snapset”,
         "value":”AgIZAAAAAQAAAA...“,
         "Base64":true}
      ]
    },
    {
      "osd":62,
      "errors":[],
      "size":4194304,
      "omap_digest":"0xffffffff”,
      "data_digest":"0xf437a612”,
      "attrs”:[
        {"name":"_”,
         "value":”EAgNAQAABAM1AA...",
         "Base64":true},
        {"name":"snapset”,
         "value":”AgIZAAAAAQAAAA…",
         "Base64":true}
      ]
    },
    {
      "osd":67,
      "errors":["oi_attr_missing”],
      "size":4194304,
      "omap_digest":"0xffffffff”,
      "data_digest":"0xf437a612”,
      "attrs":[]
    }
  ]
}

Clearly, on osd.67, the “attrs” array is empty. The question is,
how do I fix this?

Many thanks in advance,

-kc

K.C. Wong
kcwong@xxxxxxxxxxx
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

On Nov 11, 2018, at 10:58 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

On Mon, Nov 12, 2018 at 4:21 PM Ashley Merrick <singapore@xxxxxxxxxxxxxx> wrote:

Your need to run "ceph pg deep-scrub 1.65" first

Right, thanks Ashley. That's what the "Note that you may have to do a
deep scrub to populate the output." part of my answer meant but
perhaps I needed to go further?

The system has a record of a scrub error on a previous scan but
subsequent activity in the cluster has invalidated the specifics. You
need to run another scrub to get the specific information for this pg
at this point in time (the information does not remain valid
indefinitely and therefore may need to be renewed depending on
circumstances).

On Mon, Nov 12, 2018 at 2:20 PM K.C. Wong <kcwong@xxxxxxxxxxx> wrote:

Hi Brad,

I got the following:

[root@mgmt01 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,47]
1 scrub errors
[root@mgmt01 ~]# rados list-inconsistent-obj 1.65
No scrub information available for pg 1.65
error 2: (2) No such file or directory
[root@mgmt01 ~]# rados list-inconsistent-snapset 1.65
No scrub information available for pg 1.65
error 2: (2) No such file or directory

Rather odd output, I’d say; not that I understand what
that means. I also tried ceph list-inconsistent-pg:

[root@mgmt01 ~]# rados lspools
rbd
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
ctrl-p
prod
corp
camp
dev
default.rgw.users.uid
default.rgw.users.keys
default.rgw.buckets.index
default.rgw.buckets.data
default.rgw.buckets.non-ec
[root@mgmt01 ~]# for i in $(rados lspools); do rados list-inconsistent-pg $i; done
[]
["1.65"]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

So, that’d put the inconsistency in the cephfs_data pool.

Thank you for your help,

-kc

K.C. Wong
kcwong@xxxxxxxxxxx
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------

4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE

hkps://hkps.pool.sks-keyservers.net

On Nov 11, 2018, at 5:43 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

What does "rados list-inconsistent-obj <pg>" say?

Note that you may have to do a deep scrub to populate the output.
On Mon, Nov 12, 2018 at 5:10 AM K.C. Wong <kcwong@xxxxxxxxxxx> wrote:

Hi folks,

I would appreciate any pointer as to how I can resolve a
PG stuck in “active+clean+inconsistent” state. This has
resulted in HEALTH_ERR status for the last 5 days with no
end in sight. The state got triggered when one of the drives
in the PG returned I/O error. I’ve since replaced the failed
drive.

I’m running Jewel (out of centos-release-ceph-jewel) on
CentOS 7. I’ve tried “ceph pg repair <pg>” and it didn’t seem
to do anything. I’ve tried even more drastic measures such as
comparing all the files (using filestore) under that PG_head
on all 3 copies and then nuking the outlier. Nothing worked.

Many thanks,

-kc

K.C. Wong
kcwong@xxxxxxxxxxx
M: +1 (408) 769-8235

-----------------------------------------------------
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-----------------------------------------------------
4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
signature.asc

Description: Message signed with OpenPGP
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com