Hello, On Wed, 08 Jun 2016 20:26:56 +0000 Krzysztof Nowicki wrote: > Hi, > > śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta < > gandalf.corvotempesta@xxxxxxxxx> napisał: > > > 2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki < > > krzysztof.a.nowicki@xxxxxxxxx>: > > > From my own experience with failing HDDs I've seen cases where the > > > drive > > was > > > failing silently initially. This manifested itself in repeated deep > > > scrub failures. Correct me if I'm wrong here, but Ceph keeps > > > checksums of data being written and in case that data is read back > > > corrupted on one of the OSDs this will be detected by scrub and > > > reported as inconsistency. In > > such > > > cases automatic repair should be sufficient as having the checksums > > > it is possible to tell which copy is correct. In such case the OSD > > > will not be removed automatically and it's for the cluster > > > administrator to get suspicious in case such an inconsistency occurs > > > repeatedly and remove the OSD in question. > > > > Ok but could this lead to data corruption? What would happens to the > > client if a write fails? > > > If a write fails due to an IO error on the underlying HDD the OSD daemon > will most likely abort. Indeed it will. > In case a write succeeds but gets corrupted by a silent HDD failure you > will have corrupted data on this OSD. I'm not sure if Ceph verifies the > checksums upon read, but if it doesn't then the data read back by the > client could be corrupted in case the corruption happened on the primary > OSD for that PG. That. Ceph currently doesn't do any (relevant) checksumming at all, so if a PRIMARY PG suffers from bit-rot this will be undetected until the next deep-scrub. This is one of the longest and gravest outstanding issues with Ceph and supposed to be addressed with bluestore (which currently doesn't have checksum verified reads either). > The behaviour could also be affected by the filesystem the OSD is > running. For example BTRFS is known for keeping data checksums and in > such case reading corrupted data will fail at filesystem level and the > OSD will just see an IO error. > Correct. However BTRFS (and ZFS) as filestore for Ceph do open other cans of worms. Regards, Christian > > > > > When the drive fails more severely and causes IO failures then the > > > effect will most likely be an abort of the OSD daemon which causes > > > the relevant > > OSD > > > to go down. The cause of the abort can be determined by examining the > > logs. > > > > In this case, healing and rebalancing is done automatically, right? > > If I want a replica 3 and one OSD fails, the objects stored on that OSD > > would > > be automatically moved and replicated across the cluster to keep my > > replica requirement? > > > Yes, this is correct. > > > > > > In any case SMART is your best friend and it is strongly advised to > > > run smartd in order to get early warnings. > > > > Yes, but SMART is not always reliable. > > > True, but it won't harm to have it running anyway. > > > > > All modern RAID controllers are able to read the whole disk (or disks) > > looking for bad sectors or inconsistency, > > the smart extended test doesn't do this > > > Strange. From what I understood the extended SMART test actually goes > over each sector and tests it for readability. > > Regards > Chris -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com