Re: pgs inconsistent, scrub errors

Martin B Nielsen <martin@xxxxxxxxxxx> · Fri, 22 Feb 2013 15:18:19 +0100

Sorry, I don't know the answer to that one; I've looked over the doc
and since it seems to be your primary pg which is down, it will not
auto-repair. I guess your sdc = osd38?

So I believe at this time your cluster needs to find a new osd to
be/host the primary pg.

I'm guessing you can do that by marking your bad osd out, let it
replicate what it can - mark it down and see if that fixes it.

But, this is just me guessing - I would wait for someone else to chime
in or try in IRC.

/Martin

On Fri, Feb 22, 2013 at 2:18 PM, femi anjorin <femi.anjorin@xxxxxxxxx> wrote:
> Hi Martin,
>
> You are perfectly right. I had check the pg num earlier...and found the host.
>
> i did dmesg on the host ... one of the drives is already responding
> error. with this log.
>
> sd 0:0:2:0: [sdc] Unhandled sense code
> sd 0:0:2:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 0:0:2:0: [sdc] Sense Key : Hardware Error [current]
> sd 0:0:2:0: [sdc] Add. Sense: Internal target failure
> sd 0:0:2:0: [sdc] CDB: Read(16): 88 00 00 00 00 03 80 01 18 20 00 00 00 08 00 00
> XFS (sdc): I/O error occurred: meta-data dev sdc block 0x380011820
>   ("xfs_trans_read_buf") error 121 buf count 4096
> XFS (sdc): I/O error occurred: meta-data dev sdc block 0x380011820
>   ("xfs_trans_read_buf") error 121 buf count 4096
> XFS (sdc): I/O error occurred: meta-data dev sdc block 0x380011820
>   ("xfs_trans_read_buf") error 121 buf count 4096
> XFS (sdc): I/O error occurred: meta-data dev sdc block 0x380011820
>   ("xfs_trans_read_buf") error 121 buf count 4096
> sd 0:0:2:0: [sdc] Unhandled sense code
> sd 0:0:2:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 0:0:2:0: [sdc] Sense Key : Hardware Error [current]
> sd 0:0:2:0: [sdc] Add. Sense: Internal target failure
> sd 0:0:2:0: [sdc] CDB: Read(16): 88 00 00 00 00 03 80 01 18 20 00 00 00 08 00 00
>
> Then i check the  consistency of the drive on linux and it was ok...
>
> Then I got back to ceph to do the following :
>
> # ceph health details | more
> HEALTH_ERR 1 pgs inconsistent; recovery  recovering 5 o/s, 4307B/s; 1 scrub erro
> rs
> pg 1.73c is active+clean+inconsistent, acting [38,68]
> recovery  recovering 5 o/s, 4307B/s
> 1 scrub errors
>
> # ceph pg repair 1.73c
> instructing pg 1.73c on osd.38 to repair
>
>
> ## ceph -w
>    health HEALTH_ERR 1 pgs inconsistent; 1 pgs stuck unclean; recovery
> 1/4240325 degraded (0.000%); 1 scrub errors
>    monmap e1: 3 mons at
> {a=172.16.0.25:6789/0,b=172.16.0.24:6789/0,c=172.16.0.27:6789/0},
> election epoch 38, quorum 0,1,2 a,b,c
>    osdmap e1020: 96 osds: 96 up, 96 in
>     pgmap v10240: 12416 pgs: 12415 active+clean, 1
> active+inconsistent; 10738 MB data, 1009 GB used, 674 TB / 675 TB
> avail; 1/4240325 degraded (0.000%)
>    mdsmap e35: 1/1/1 up {0=b=up:active}, 1 up:standby
>
>
> 2013-02-22 14:04:36.338239 osd.38 [ERR] 1.73c missing primary copy of
> 9d7a673c/100001b30                             6c.00000000/head//1,
> unfound
>
>
> Summary: pg wont repair... what do u suggest????
>
>
> Regards,
> Femi.
>
>
> On Fri, Feb 22, 2013 at 1:26 PM, Martin B Nielsen <martin@xxxxxxxxxxx> wrote:
>> Hi Femi,
>>
>> I just had a few of those as well - turned out it was a disk going bad
>> and it eventually died ~12h after those turned up.
>>
>> While it was ongoing I fixed it with first finding the pg in question with:
>>
>> ceph pg dump | grep inconsistent
>>
>> You should get a pg id then; then I did a deep scrub of it
>>
>> ceph pg deep-scrub <pg_id>
>>
>> Watched the logs and found that it was inconsistent. I checked dmesg
>> and syslog and found a disk had reported a badblock via smart. I
>> continued to repair it with:
>>
>> ceph pg repair <pg_id>
>>
>> I verified with another deep-scrub afterwards.
>>
>> More info here: http://eu.ceph.com/docs.raw/ref/wip-3072/control/#pg-subsystem
>>
>> /Martin
>>
>> On Fri, Feb 22, 2013 at 1:18 PM, femi anjorin <femi.anjorin@xxxxxxxxx> wrote:
>>> Hi ,
>>>
>>> Pls how can should i solve this error?
>>>
>>> #ceph health
>>> HEALTH_ERR 1 pgs inconsistent, 1 scrub errors
>>>
>>> I just want to take the cluster back to the clean state.
>>>
>>>
>>> Regards.
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com