On Mon, Nov 23, 2015 at 4:15 PM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:
Hi,
Le 23/11/2015 18:37, Jose Tavares a écrit :
> Yes, but with SW-RAID, when we have a block that was read and does not match its checksum, the device falls out of the array
I don't think so. Under normal circumstances a device only falls out of a md array if it doesn't answer IO queries after a timeout (md arrays only read from the smallest subset of devices needed to get the data, they don't verify redundancy on the fly for performance reasons). This may not be the case when you explicitly ask an array to perform a check though (I don't have any first-hand check failure coming to mind).
>, and the data is read again from the other devices in the array. The problem is that in SW-RAID1 we don't have the badblocks isolated. The disks can be sincronized again as the write operation is not tested. The problem (device falling out of the array) will happen again if we try to read any other data written over the bad block. With consumer-level SATA drives badblocks are handled internally nowadays : the drives remap bad sectors to a reserve by trying to copy their content there (this might fail and md might not have the opportunity to correct the error: it doesn't use checksums so it can't tell which drive has unaltered data, only which one doesn't answer IO queries).
hmm, suppose the drive is unable to remap bad blocks internally, when you write data to the drive, it will also write in hardware the data checksum. When you read the data, it will compare to this checksum that was written previously. If it fails, the drive will reset and the SW-RAID will drop the drive. This is how sata drives work..
> My new question regarding Ceph is if it isolates this bad sectors where it found bad data when scrubbing? or there will be always a replica of something over a known bad block..?
Ceph OSDs don't know about bad sectors, they delegate IO to the filesystems beneath them. Some filesystems can recover from corrupted data from one drive (ZFS or BTRFS when using redundancy at their level) and the same filesystems will refuse to give Ceph OSD data when they detect corruption on non redundant filesystems, Ceph detects this (usually during scrubs) and then manual Ceph repair will rewrite data over the corrupted data (at this time if the underlying drive detected a bad sector it will not reuse it).
Just forget about the hardware bad block remapped list. It got filled as soon as we start to use the drive .. :)
> > I also saw that Ceph use same metrics when capturing data from disks. When the disk is resetting or have problems, its metrics are going to be bad and the cluster will rank bad this osd. But I didn't saw any way of sending alerts or anything like that. SW-RAID has its mdadm monitor that alerts when things go bad. Should I have to be looking for ceph logs all the time to see when things go bad?
I'm not aware of any osd "ranking".
Lionel
Does "weight" means the same?
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com