Re: Inconsistent PG's, repair ineffective

David Zafman <david.zafman@xxxxxxxxxxx> · Wed, 22 May 2013 15:07:26 -0700

You need to find out where the third copy is.  Corrupt it.  Then let repair copy the data from a good copy.

$ ceph pg map 19.1b

You should see something like this:
osdmap e158 pg 19.1b (19.1b) -> up [13, 22, xx] acting [13, 22, xx]

The osd xx that is NOT 13 or 22 has the corrupted copy.    Connect to the node that has that osd.

Find in the mount for osd xx your object with name "rb.0.6989.2ae8944a.00000000005b"

$ find /var/lib/ceph/osd/ceph-xx -name 'rb.0.6989.2ae8944a.00000000005b*' -ls
201326612    4 -rw-r--r--   1 root     root          255 May 22 14:11 /var/lib/ceph/osd/ceph-xx/current/19.1b_head/rb.0.6989.2ae8944a.00000000005b__head_XXXXXXXX__0

I would stop osd xx, first.  In this case we find the file is 255 bytes long.  In order to make sure this bad copy isn't used.  Let's make the file 1 byte longer.

$ truncate -s 256 /var/lib/ceph/osd/ceph-xx/current/19.1b_head/rb.0.6989.2ae8944a.00000000005b__head_XXXXXXXX__0

Restart osd xx.  Not sure how what command does that on your platform.

Verify that OSDs are all running.  Shows all osds are up and in.
$ ceph -s | grep osdmap
osdmap e6: 6 osds: 6 up, 6 in

$ ceph osd repair 19.1b
instructing pg 19.1b on osd.13 to repair

David Zafman
Senior Developer
http://www.inktank.com

On May 21, 2013, at 3:39 PM, John Nielsen <lists@xxxxxxxxxxxx> wrote:

> I've checked, all the disks are fine and the cluster is healthy except for the inconsistent objects.
> 
> How would I go about manually repairing?
> 
> On May 21, 2013, at 3:26 PM, David Zafman <david.zafman@xxxxxxxxxxx> wrote:
> 
>> 
>> I can't reproduce this on v0.61-2.  Could the disks for osd.13 & osd.22 be unwritable?
>> 
>> In your case it looks like the 3rd replica is probably the bad one, since osd.13 and osd.22 are the same.  You probably want to manually repair the 3rd replica.
>> 
>> David Zafman
>> Senior Developer
>> http://www.inktank.com
>> 
>> 
>> 
>> 
>> On May 21, 2013, at 6:45 AM, John Nielsen <lists@xxxxxxxxxxxx> wrote:
>> 
>>> Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
>>> 
>>> On May 21, 2013, at 12:13 AM, David Zafman <david.zafman@xxxxxxxxxxx> wrote:
>>> 
>>>> 
>>>> What version of ceph are you running?
>>>> 
>>>> David Zafman
>>>> Senior Developer
>>>> http://www.inktank.com
>>>> 
>>>> On May 20, 2013, at 9:14 AM, John Nielsen <lists@xxxxxxxxxxxx> wrote:
>>>> 
>>>>> Some scrub errors showed up on our cluster last week. We had some issues with host stability a couple weeks ago; my guess is that errors were introduced at that point and a recent background scrub detected them. I was able to clear most of them via "ceph pg repair", but several remain. Based on some other posts, I'm guessing that they won't repair because it is the primary copy that has the error. All of our pools are set to size 3 so there _ought_ to be a way to verify and restore the correct data, right?
>>>>> 
>>>>> Below is some log output about one of the problem PG's. Can anyone suggest a way to fix the inconsistencies?
>>>>> 
>>>>> 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501
>>>>> 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501
>>>>> 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 19.1b repair 0 missing, 1 inconsistent objects
>>>>> 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 19.1b repair 2 errors, 2 fixed
>>>>> 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501
>>>>> 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501
>>>>> 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 19.1b deep-scrub 0 missing, 1 inconsistent objects
>>>>> 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 19.1b deep-scrub 2 errors
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> JN
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>> 
>> 
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com