Re: Ceph pg active+clean+inconsistent

Shinobu Kinjo <skinjo@xxxxxxxxxx> · Sat, 24 Dec 2016 07:10:36 +0900

Plus do this as well:

 # rados list-inconsistent-obj ${PG ID}

On Fri, Dec 23, 2016 at 7:08 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
> Could you also try this?
>
> $ attr -l ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.00000009__head_EED893F1__6
>
> Take note of any of ceph._, ceph._@1, ceph._@2, etc.
>
> For me on my test cluster it looks like this.
>
> $ attr -l dev/osd1/current/0.3_head/benchmark\\udata\\urskikr.localdomain\\u16952\\uobject99__head_2969453B__0
> Attribute "cephos.spill_out" has a 2 byte value for
> dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
> Attribute "ceph._" has a 250 byte value for
> dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
> Attribute "ceph.snapset" has a 31 byte value for
> dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
> Attribute "ceph._@1" has a 53 byte value for
> dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
> Attribute "selinux" has a 37 byte value for
> dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
>
> Then dump out ceph._ to a file and append all ceph._@X attributes like so.
>
> $ attr -q -g ceph._
> dev/osd1/current/0.3_head/benchmark\\udata\\urskikr.localdomain\\u16952\\uobject99__head_2969453B__0
>> /tmp/attr1
> $ attr -q -g ceph._@1
> dev/osd1/current/0.3_head/benchmark\\udata\\urskikr.localdomain\\u16952\\uobject99__head_2969453B__0
>>> /tmp/attr1
>
> Note the ">>" on the second command to append the output, not
> overwrite. Do this for each ceph._@X attribute.
>
> Then display the file as an object_info_t structure and check the size value.
>
> $ bin/ceph-dencoder type object_info_t import /tmp/attr1 decode dump_json
> {
>     "oid": {
>         "oid": "benchmark_data_rskikr.localdomain_16952_object99",
>         "key": "",
>         "snapid": -2,
>         "hash": 694764859,
>         "max": 0,
>         "pool": 0,
>         "namespace": ""
>     },
>     "version": "9'19",
>     "prior_version": "0'0",
>     "last_reqid": "client.4110.0:100",
>     "user_version": 19,
>     "size": 4194304,
>     "mtime": "2016-12-23 19:13:57.012681",
>     "local_mtime": "2016-12-23 19:13:57.032306",
>     "lost": 0,
>     "flags": 52,
>     "snaps": [],
>     "truncate_seq": 0,
>     "truncate_size": 0,
>     "data_digest": 2293522445,
>     "omap_digest": 4294967295,
>     "expected_object_size": 4194304,
>     "expected_write_size": 4194304,
>     "alloc_hint_flags": 53,
>     "watchers": {}
> }
>
> Depending on the output one method for fixing this may be to use a
> binary editing technique such a laid out in
> https://www.spinics.net/lists/ceph-devel/msg16519.html to set the size
> value to zero. Your target value is 1c0000.
>
> $ printf '%x\n' 1835008
> 1c0000
>
> Make sure you check it is right before injecting it back in with "attr -s"
>
> What version is this? Did you look for a similar bug on the tracker?
>
> HTH.
>
>
> --
> Cheers,
> Brad
>
> On Fri, Dec 23, 2016 at 4:27 PM, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote:
>> Would you be able to execute ``ceph pg ${PG ID} query`` against that
>> particular PG?
>>
>> On Wed, Dec 21, 2016 at 11:44 PM, Andras Pataki
>> <apataki@xxxxxxxxxxxxxxxxxxxx> wrote:
>>> Yes, size = 3, and I have checked that all three replicas are the same zero
>>> length object on the disk.  I think some metadata info is mismatching what
>>> the OSD log refers to as "object info size".  But I'm not sure what to do
>>> about it.  pg repair does not fix it.  In fact, the file this object
>>> corresponds to in CephFS is shorter so this chunk shouldn't even exist I
>>> think (details are in the original email).  Although I may be understanding
>>> the situation wrong ...
>>>
>>> Andras
>>>
>>>
>>> On 12/21/2016 07:17 AM, Mehmet wrote:
>>>
>>> Hi Andras,
>>>
>>> Iam not the experienced User but i guess you could have a look on this
>>> object on each related osd for the pg, compare them and delete the Different
>>> object. I assume you have size = 3.
>>>
>>> Then again pg repair.
>>>
>>> But be carefull iirc the replica will be recovered from the primary pg.
>>>
>>> Hth
>>>
>>> Am 20. Dezember 2016 22:39:44 MEZ, schrieb Andras Pataki
>>> <apataki@xxxxxxxxxxxxxxxxxxxx>:
>>>>
>>>> Hi cephers,
>>>>
>>>> Any ideas on how to proceed on the inconsistencies below?  At the moment
>>>> our ceph setup has 5 of these - in all cases it seems like some zero length
>>>> objects that match across the three replicas, but do not match the object
>>>> info size.  I tried running pg repair on one of them, but it didn't repair
>>>> the problem:
>>>>
>>>> 2016-12-20 16:24:40.870307 7f3e1a4b1700  0 log_channel(cluster) log [INF]
>>>> : 6.92c repair starts
>>>> 2016-12-20 16:27:06.183186 7f3e1a4b1700 -1 log_channel(cluster) log [ERR]
>>>> : repair 6.92c 6:34932257:::1000187bbb5.00000009:head on disk size (0) does
>>>> not match object info size (3014656) adjusted for ondisk to (3014656)
>>>> 2016-12-20 16:27:35.885496 7f3e17cac700 -1 log_channel(cluster) log [ERR]
>>>> : 6.92c repair 1 errors, 0 fixed
>>>>
>>>>
>>>> Any help/hints would be appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> Andras
>>>>
>>>>
>>>> On 12/15/2016 10:13 AM, Andras Pataki wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> Yesterday scrubbing turned up an inconsistency in one of our placement
>>>> groups.  We are running ceph 10.2.3, using CephFS and RBD for some VM
>>>> images.
>>>>
>>>> [root@hyperv017 ~]# ceph -s
>>>>     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>>>>      health HEALTH_ERR
>>>>             1 pgs inconsistent
>>>>             1 scrub errors
>>>>             noout flag(s) set
>>>>      monmap e15: 3 mons at
>>>> {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
>>>>             election epoch 27192, quorum 0,1,2
>>>> hyperv029,hyperv030,hyperv031
>>>>       fsmap e17181: 1/1/1 up {0=hyperv029=up:active}, 2 up:standby
>>>>      osdmap e342930: 385 osds: 385 up, 385 in
>>>>             flags noout
>>>>       pgmap v37580512: 34816 pgs, 5 pools, 673 TB data, 198 Mobjects
>>>>             1583 TB used, 840 TB / 2423 TB avail
>>>>                34809 active+clean
>>>>                    4 active+clean+scrubbing+deep
>>>>                    2 active+clean+scrubbing
>>>>                    1 active+clean+inconsistent
>>>>   client io 87543 kB/s rd, 671 MB/s wr, 23 op/s rd, 2846 op/s wr
>>>>
>>>> # ceph pg dump | grep inconsistent
>>>> 6.13f1  4692    0       0       0       0 16057314767     3087    3087
>>>> active+clean+inconsistent 2016-12-14 16:49:48.391572      342929'41011
>>>> 342929:43966 [158,215,364]   158     [158,215,364]   158     342928'40540
>>>> 2016-12-14 16:49:48.391511      342928'40540    2016-12-14 16:49:48.391511
>>>>
>>>> I tried a couple of other deep scrubs on pg 6.13f1 but got repeated
>>>> errors.  In the OSD logs:
>>>>
>>>> 2016-12-14 16:48:07.733291 7f3b56e3a700 -1 log_channel(cluster) log [ERR]
>>>> : deep-scrub 6.13f1 6:8fc91b77:::1000187bb70.00000009:head on disk size (0)
>>>> does not match object info size (1835008) adjusted for ondisk to (1835008)
>>>> I looked at the objects on the 3 OSD's on their respective hosts and they
>>>> are the same, zero length files:
>>>>
>>>> # cd ~ceph/osd/ceph-158/current/6.13f1_head
>>>> # find . -name *1000187bb70* -ls
>>>> 669738    0 -rw-r--r--   1 ceph     ceph            0 Dec 13 17:00
>>>> ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.00000009__head_EED893F1__6
>>>>
>>>> # cd ~ceph/osd/ceph-215/current/6.13f1_head
>>>> # find . -name *1000187bb70* -ls
>>>> 539815647    0 -rw-r--r--   1 ceph     ceph            0 Dec 13 17:00
>>>> ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.00000009__head_EED893F1__6
>>>>
>>>> # cd ~ceph/osd/ceph-364/current/6.13f1_head
>>>> # find . -name *1000187bb70* -ls
>>>> 1881432215    0 -rw-r--r--   1 ceph     ceph            0 Dec 13 17:00
>>>> ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.00000009__head_EED893F1__6
>>>>
>>>> At the time of the write, there wasn't anything unusual going on as far as
>>>> I can tell (no hardware/network issues, all processes were up, etc).
>>>>
>>>> This pool is a CephFS data pool, and the corresponding file (inode hex
>>>> 1000187bb70, decimal 1099537300336) looks like this:
>>>>
>>>> # ls -li chr4.tags.tsv
>>>> 1099537300336 -rw-r--r-- 1 xichen xichen 14469915 Dec 13 17:01
>>>> chr4.tags.tsv
>>>>
>>>> Reading the file is also ok (no errors, right number of bytes):
>>>> # cat chr4.tags.tsv > /dev/null
>>>> # wc chr4.tags.tsv
>>>>   592251  2961255 14469915 chr4.tags.tsv
>>>>
>>>> We are using the standard 4MB block size for CephFS, and if I interpret
>>>> this right, this is the 9th chunk, so there shouldn't be any data (or even a
>>>> 9th chunk), since the file is only 14MB.  Should I run pg repair on this?
>>>> Any ideas on how this could come about? Any other recommendations?
>>>>
>>>> Thanks,
>>>>
>>>> Andras
>>>> apataki@xxxxxxxxxxx
>>>>
>>>>
>>>> ________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com