Re: HDD bad sector, pg inconsistent, no object remapping

Mihály Árva-Tóth <mihaly.arva-toth@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 19 Nov 2013 11:06:35 +0100

Hello David and Chris,

Thank you for your replies in this thread. 

>> The automatic repair should handle getting an EIO during read of the object replica. 

I think when osd tries to read object from primary disk, which inside of bad sector, the controller does not respond with EIO but something else.

If you can help me how can I debug response code I try to know.

Thank you,
Mihaly

2013/11/19 David Zafman <david.zafman@xxxxxxxxxxx>

I looked at the code.  The automatic repair should handle getting an EIO during read of the object replica.  It does NOT require removing the object as I said before, so it doesn’t matter which copy has bad sectors.  It will copy from a good replica to the primary, if necessary.  By default a deep-scrub which would catch this case is performed weekly.  A repair must be initiated by administrative action.

When replicas differ due to comparison of checksums, we currently don’t have a way to determine which copy(s) are corrupt.  This is where a manual intervention may be necessary if the administrator can determine which copy(s) are bad.

David Zafman

Senior Developer

http://www.inktank.com

On Nov 18, 2013, at 1:11 PM, Chris Dunlop <chris@xxxxxxxxxxxx> wrote:

> OK, that's good (as far is it goes, being a manual process).

>

> So then, back to what I think was Mihály's original issue:

>

>> pg repair or deep-scrub can not fix this issue. But if I

>> understand correctly, osd has to known it can not retrieve

>> object from osd.0 and need to be replicate an another osd

>> because there is no 3 working replicas now.

>

> Given a bad checksum and/or read error tells ceph that an object

> is corrupt, it would seem to be a natural step to then have ceph

> automatically use another good-checksum copy, and even rewrite

> the corrupt object, either in normal operation or under a scub

> or repair.

>

> Is there a reason this isn't done, apart from lack of tuits?

>

> Cheers,

>

> Chris

>

>

> On Mon, Nov 18, 2013 at 11:43:46AM -0800, David Zafman wrote:

>>

>> No, you wouldn’t need to re-replicate the whole disk for a single bad sector.  The way to deal with that if the object is on the primary is to remove the file manually from the OSD’s filesystem and perform a repair of the PG that holds that object.  This will copy the object back from one of the replicas.

>>

>> David

>>

>> On Nov 17, 2013, at 10:46 PM, Chris Dunlop <chris@xxxxxxxxxxxx> wrote:

>>

>>> Hi David,

>>>

>>> On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote:

>>>>

>>>> Replication does not occur until the OSD is “out.”  This creates a new mapping in the cluster of where the PGs should be and thus data begins to move and/or create sufficient copies.  This scheme lets you control how and when you want the replication to occur.  If you have plenty of space and you aren’t going to replace the drive immediately, just mark the OSD “down" AND “out.".  If you are going to replace the drive immediately, set the “noout” flag.  Take the OSD “down” and replace drive.  Assuming it is mounted in the same place as the bad drive, bring the OSD back up.  This will replicate exactly the same PGs the bad drive held back to the replacement drive.  As was stated before don’t forget to “ceph osd unset noout"

>>>>

>>>> Keep in mind that in the case of a machine that has a hardware failure and takes OSD(s) down there is an automatic timeout which will mark them “out" for unattended operation.  Unless you are monitoring the cluster 24/7 you should have enough disk space available to handle failures.

>>>>

>>>> Related info in:

>>>>

>>>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

>>>>

>>>> David Zafman

>>>> Senior Developer

>>>> http://www.inktank.com

>>>

>>>

>>> Are you saying, if a disk suffers from a bad sector in an object

>>> for which it's primary, and for which good data exists on other

>>> replica PGs, there's no way for ceph to recover other than by

>>> (re-)replicating the whole disk?

>>>

>>> I.e., even if the disk is able to remap the bad sector using a

>>> spare, so the disk is ok (albeit missing a sector's worth of

>>> object data), the only way to recover is to basically blow away

>>> all the data on that disk and start again, replicating

>>> everything back to the disk (or to other disks)?

>>>

>>> Cheers,

>>>

>>> Chris.

-- 
Best regards,
Mihály Árva-Tóth

System Engineer

Virtual Call Center GmbH
Address: 23-33  Csalogány Street, Budapest 1027, Hungary
Tel: +36 1 999 7400
Mobile: +36 30 473 9256

Fax: +36 1 999 7401
E-mail: mihaly.arva-toth@xxxxxxxxxxxxxxxxxxxxxx
Web: www.virtual-call-center.eu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com