Re: CEPH over SW-RAID

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Nov 23, 2015 at 5:26 PM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:
Le 23/11/2015 19:58, Jose Tavares a écrit :


On Mon, Nov 23, 2015 at 4:15 PM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:
Hi,

Le 23/11/2015 18:37, Jose Tavares a écrit :
> Yes, but with SW-RAID, when we have a block that was read and does not match its checksum, the device falls out of the array

I don't think so. Under normal circumstances a device only falls out of a md array if it doesn't answer IO queries after a timeout (md arrays only read from the smallest subset of devices needed to get the data, they don't verify redundancy on the fly for performance reasons). This may not be the case when you explicitly ask an array to perform a check though (I don't have any first-hand check failure coming to mind).

>, and the data is read again from the other devices in the array. The problem is that in SW-RAID1 we don't have the badblocks isolated. The disks can be sincronized again as the write operation is not tested. The problem (device falling out of the array) will happen again if we try to read any other data written over the bad block. With consumer-level SATA drives badblocks are handled internally nowadays : the drives remap bad sectors to a reserve by trying to copy their content there (this might fail and md might not have the opportunity to correct the error: it doesn't use checksums so it can't tell which drive has unaltered data, only which one doesn't answer IO queries).


hmm, suppose the drive is unable to remap bad blocks internally, when you write data to the drive, it will also write in hardware the data checksum.

One weak data checksum which is not available to the kernel, yes. Filesystems and applications on top of them may use stronger checksums and handle read problems that the drives can't detect themselves.

When you read the data, it will compare to this checksum that was written previously. If it fails, the drive will reset and the SW-RAID will drop the drive. This is how sata drives work..

If it fails AFAIK from past experience it doesn't reset by itself, the kernel driver in charge of the device will receive an IO error and will retry the IO several times. One of those latter attempts might succeed (errors aren't always repeatable) and eventually after a timeout it will try to reset the interface with the drive and the drive itself (the kernel doesn't know where the problem is only that it didn't get the result it was expecting).
While this happen I believe the filesystem/md/lvm/... stack can receive an IO error (the timeout at their level might not be the same as the timeout at the device level). So some errors can be masked to md and some can percolate through. In the later case, yes the md array will drop the device.

 


> My new question regarding Ceph is if it isolates this bad sectors where it found bad data when scrubbing? or there will be always a replica of something over a known bad block..?
Ceph OSDs don't know about bad sectors, they delegate IO to the filesystems beneath them. Some filesystems can recover from corrupted data from one drive (ZFS or BTRFS when using redundancy at their level) and the same filesystems will refuse to give Ceph OSD data when they detect corruption on non redundant filesystems, Ceph detects this (usually during scrubs) and then manual Ceph repair will rewrite data over the corrupted data (at this time if the underlying drive detected a bad sector it will not reuse it).

Just forget about the hardware bad block remapped list. It got filled as soon as we start to use the drive .. :)

Then you can move this drive to the trash pile/ask for a replacement. It is basically unusable.

Why?
1 (or more) out of 8 drives I see have the remap list full ...
If you isolate the rest using software you can continue to use the drive .. There are no performance issues, etc ..


 


 

> > I also saw that Ceph use same metrics when capturing data from disks. When the disk is resetting or have problems, its metrics are going to be bad and the cluster will rank bad this osd. But I didn't saw any way of sending alerts or anything like that. SW-RAID has its mdadm monitor that alerts when things go bad. Should I have to be looking for ceph logs all the time to see when things go bad?
I'm not aware of any osd "ranking".

Lionel


Does "weight" means the same?

There are 2 weights I'm aware of, the crush weight for an OSD and the temporary OSD weight. The first is the basic weight used by crush to choose how to split your data (an OSD with a weight of 2 is expected to get roughly twice the amount of data of an OSD with a weight of 1 on a normal Ceph cluster), the second is used for temporary adjustments when a OSD gets temporarily overused (during cluster wide rebalancing typically) and is reset when the OSD rejoins the cluster (marked in).

Neither of these weights has anything to do with the OSD underlying device health ("being bad").

Lionel

I don't know where I read about it .. Maybe when I read about scrubbing ..
 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux