Re: Inconsistent PG won't repair

Richard Bade <hitrich@xxxxxxxxx> · Sat, 21 Oct 2017 04:59:05 +1300

Hi Lincoln,
Yes the object is 0-bytes on all OSD's. Has the same filesystem
date/time too. Before I removed the rbd image (migrated disk to
different pool) it was 4MB on all the OSD's and md5 checksum was the
same on all so it seems that only metadata is inconsistent.
Thanks for your suggestion, I just looked into this as I thought maybe
I can delete the object (since it's empty anyway). But I just get file
not found:
~$ rados stat rbd_data.19cdf512ae8944a.000000000001bb56 --pool=tier3-rbd-3X
 error stat-ing
tier3-rbd-3X/rbd_data.19cdf512ae8944a.000000000001bb56: (2) No such
file or directory

Regards,
Rich

On 21 October 2017 at 04:32, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
> Hi Rich,
>
> Is the object inconsistent and 0-bytes on all OSDs?
>
> We ran into a similar issue on Jewel, where an object was empty across the board but had inconsistent metadata. Ultimately it was resolved by doing a "rados get" and then a "rados put" on the object. *However* that was a last ditch effort after I couldn't get any other repair option to work, and I have no idea if that will cause any issues down the road :)
>
> --Lincoln
>
>> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitrich@xxxxxxxxx> wrote:
>>
>> Hi Everyone,
>> In our cluster running 0.94.10 we had a pg pop up as inconsistent
>> during scrub. Previously when this has happened running ceph pg repair
>> [pg_num] has resolved the problem. This time the repair runs but it
>> remains inconsistent.
>> ~$ ceph health detail
>> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set
>> pg 3.f05 is active+clean+inconsistent, acting [171,23,131]
>> 1 scrub errors
>>
>> The error in the logs is:
>> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub
>> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.000000000001bb56/snapdir
>> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.000000000001bb56/148d2
>>
>> Now, I've tried several things to resolve this. I've tried stopping
>> each of the osd's in turn and running a repair. I've located the rbd
>> image and removed it to empty out the object. The object is now zero
>> bytes but still inconsistent. I've tried stopping each osd, removing
>> the object and starting the osd again. It correctly identifies the
>> object as missing and repair works to fix this but it still remains
>> inconsistent.
>> I've run out of ideas.
>> The object is now zero bytes:
>> ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name
>> "*19cdf512ae8944a.000000000001bb56*" -ls
>> 537598582      0 -rw-r--r--   1 root     root            0 Oct 21
>> 03:54 /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.000000000001bb56__snapdir_68AB5F05__3
>>
>> How can I resolve this? Is there some way to remove the empty object
>> completely? I saw reference to ceph-objectstore-tool which has some
>> options to remove-clone-metadata but I don't know how to use this.
>> Will using this to remove the mentioned 148d2 expected clone resolve
>> this? Or would this do the opposite as it would seem that it can't
>> find that clone?
>> Documentation on this tool is sparse.
>>
>> Any help here would be appreciated.
>>
>> Regards,
>> Rich
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com