Re: Disk failures

Krzysztof Nowicki <krzysztof.a.nowicki@xxxxxxxxx> · Wed, 08 Jun 2016 20:26:56 +0000

Hi,

śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta <gandalf.corvotempesta@xxxxxxxxx> napisał:
2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <krzysztof.a.nowicki@xxxxxxxxx>:

> From my own experience with failing HDDs I've seen cases where the drive was

> failing silently initially. This manifested itself in repeated deep scrub

> failures. Correct me if I'm wrong here, but Ceph keeps checksums of data

> being written and in case that data is read back corrupted on one of the

> OSDs this will be detected by scrub and reported as inconsistency. In such

> cases automatic repair should be sufficient as having the checksums it is

> possible to tell which copy is correct. In such case the OSD will not be

> removed automatically and it's for the cluster administrator to get

> suspicious in case such an inconsistency occurs repeatedly and remove the

> OSD in question.

Ok but could this lead to data corruption? What would happens to the client

if a write fails?
If a write fails due to an IO error on the underlying HDD the OSD daemon will most likely abort.
In case a write succeeds but gets corrupted by a silent HDD failure you will have corrupted data on this OSD. I'm not sure if Ceph verifies the checksums upon read, but if it doesn't then the data read back by the client could be corrupted in case the corruption happened on the primary OSD for that PG.
The behaviour could also be affected by the filesystem the OSD is running. For example BTRFS is known for keeping data checksums and in such case reading corrupted data will fail at filesystem level and the OSD will just see an IO error.

> When the drive fails more severely and causes IO failures then the effect

> will most likely be an abort of the OSD daemon which causes the relevant OSD

> to go down. The cause of the abort can be determined by examining the logs.

In this case, healing and rebalancing is done automatically, right?

If I want a replica 3 and one OSD fails, the objects stored on that OSD would

be automatically moved and replicated across the cluster to keep my

replica requirement?
Yes, this is correct. 

> In any case SMART is your best friend and it is strongly advised to run

> smartd in order to get early warnings.

Yes, but SMART is not always reliable.
True, but it won't harm to have it running anyway. 

All modern RAID controllers are able to read the whole disk (or disks)

looking for bad sectors or inconsistency,

the smart extended test doesn't do this
Strange. From what I understood the extended SMART test actually goes over each sector and tests it for readability.

Regards
Chris 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com