Re: Disk failures

Christian Balzer <chibi@xxxxxxx> · Thu, 9 Jun 2016 09:09:53 +0900

Hello,

On Wed, 08 Jun 2016 20:26:56 +0000 Krzysztof Nowicki wrote:

> Hi,
> 
> śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta <
> gandalf.corvotempesta@xxxxxxxxx> napisał:
> 
> > 2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <
> > krzysztof.a.nowicki@xxxxxxxxx>:
> > > From my own experience with failing HDDs I've seen cases where the
> > > drive
> > was
> > > failing silently initially. This manifested itself in repeated deep
> > > scrub failures. Correct me if I'm wrong here, but Ceph keeps
> > > checksums of data being written and in case that data is read back
> > > corrupted on one of the OSDs this will be detected by scrub and
> > > reported as inconsistency. In
> > such
> > > cases automatic repair should be sufficient as having the checksums
> > > it is possible to tell which copy is correct. In such case the OSD
> > > will not be removed automatically and it's for the cluster
> > > administrator to get suspicious in case such an inconsistency occurs
> > > repeatedly and remove the OSD in question.
> >
> > Ok but could this lead to data corruption? What would happens to the
> > client if a write fails?
> >
> If a write fails due to an IO error on the underlying HDD the OSD daemon
> will most likely abort.
Indeed it will.

> In case a write succeeds but gets corrupted by a silent HDD failure you
> will have corrupted data on this OSD. I'm not sure if Ceph verifies the
> checksums upon read, but if it doesn't then the data read back by the
> client could be corrupted in case the corruption happened on the primary
> OSD for that PG.
That.

Ceph currently doesn't do any (relevant) checksumming at all, so if a
PRIMARY PG suffers from bit-rot this will be undetected until the next
deep-scrub.

This is one of the longest and gravest outstanding issues with Ceph and
supposed to be addressed with bluestore (which currently doesn't have
checksum verified reads either).

> The behaviour could also be affected by the filesystem the OSD is
> running. For example BTRFS is known for keeping data checksums and in
> such case reading corrupted data will fail at filesystem level and the
> OSD will just see an IO error.
> 
Correct.

However BTRFS (and ZFS) as filestore for Ceph do open other cans of worms.

Regards,

Christian
> >
> > > When the drive fails more severely and causes IO failures then the
> > > effect will most likely be an abort of the OSD daemon which causes
> > > the relevant
> > OSD
> > > to go down. The cause of the abort can be determined by examining the
> > logs.
> >
> > In this case, healing and rebalancing is done automatically, right?
> > If I want a replica 3 and one OSD fails, the objects stored on that OSD
> > would
> > be automatically moved and replicated across the cluster to keep my
> > replica requirement?
> >
> Yes, this is correct.
> 
> >
> > > In any case SMART is your best friend and it is strongly advised to
> > > run smartd in order to get early warnings.
> >
> > Yes, but SMART is not always reliable.
> >
> True, but it won't harm to have it running anyway.
> 
> >
> > All modern RAID controllers are able to read the whole disk (or disks)
> > looking for bad sectors or inconsistency,
> > the smart extended test doesn't do this
> >
> Strange. From what I understood the extended SMART test actually goes
> over each sector and tests it for readability.
> 
> Regards
> Chris

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com