Re: Disk failures

Gandalf Corvotempesta <gandalf.corvotempesta@xxxxxxxxx> · Wed, 8 Jun 2016 21:35:13 +0200

2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <krzysztof.a.nowicki@xxxxxxxxx>:
> From my own experience with failing HDDs I've seen cases where the drive was
> failing silently initially. This manifested itself in repeated deep scrub
> failures. Correct me if I'm wrong here, but Ceph keeps checksums of data
> being written and in case that data is read back corrupted on one of the
> OSDs this will be detected by scrub and reported as inconsistency. In such
> cases automatic repair should be sufficient as having the checksums it is
> possible to tell which copy is correct. In such case the OSD will not be
> removed automatically and it's for the cluster administrator to get
> suspicious in case such an inconsistency occurs repeatedly and remove the
> OSD in question.

Ok but could this lead to data corruption? What would happens to the client
if a write fails?

> When the drive fails more severely and causes IO failures then the effect
> will most likely be an abort of the OSD daemon which causes the relevant OSD
> to go down. The cause of the abort can be determined by examining the logs.

In this case, healing and rebalancing is done automatically, right?
If I want a replica 3 and one OSD fails, the objects stored on that OSD would
be automatically moved and replicated across the cluster to keep my
replica requirement?

> In any case SMART is your best friend and it is strongly advised to run
> smartd in order to get early warnings.

Yes, but SMART is not always reliable.

All modern RAID controllers are able to read the whole disk (or disks)
looking for bad sectors or inconsistency,
the smart extended test doesn't do this
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com