Re: CEPH over SW-RAID

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Nov 23, 2015 at 4:07 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
So I assume we _are_ talking about bit-rot?

> On 23 Nov 2015, at 18:37, Jose Tavares <jat@xxxxxxxxxxxx> wrote:
>
> Yes, but with SW-RAID, when we have a block that was read and does not
> match its checksum, the device falls out of the array, and the data is read
> again from the other devices in the array.

That's not true. SW-RAID reads data from one drive only. Comparison of the data on different drives only happens when a check is executed, and that's doesn't help with bit-rot one bit :-) (the same goes for various SANs and arrays, but those usually employ additional CRC for data so their BER is orders of magnitude higher.)

 
SW-RAID reads one data from one drive each time. But the drive itself has its data checksum in hardware.

"In daily business, your hard disk does write a checksum and some ECC information for every sector being written, and verifies this data during a read operation."

"If the disk is out of replacement sectors..." .. This is the most common scenario we see these days .. so, the OS must deal with this bad blocks.


 
> The problem is that in SW-RAID1
> we don't have the badblocks isolated. The disks can be sincronized again as
> the write operation is not tested. The problem (device falling out of the
> array) will happen again if we try to read any other data written over the
> bad block.

Not true either. Bit-rot happens not (only) when the data gets written wrong, but when it is read. If you read one block long enough you will get wrong data once every $BER_bits. Rewriting the data doesn't help.
(It's a bit different with some SSDs that don't refresh blocks so rewriting/refreshing them might help).

>
> My new question regarding Ceph is if it isolates this bad sectors where it
> found bad data when scrubbing? or there will be always a replica of
> something over a known bad block..?
>
> I also saw that Ceph use same metrics when capturing data from disks. When
> the disk is resetting or have problems, its metrics are going to be bad and
> the cluster will rank bad this osd. But I didn't saw any way of sending
> alerts or anything like that. SW-RAID has its mdadm monitor that alerts
> when things go bad. Should I have to be looking for ceph logs all the time
> to see when things go bad?

You should graph every drive and look for anomalies. Ceph only detects a problem when the drive is already very unusable (the ceph-osd process itself blocks for tens of seconds typically).
CEPH is not really good when it comes to latency SLAs, no matter how much you try, but that's usually sufficient.

>
> Thanks.
> Jose Tavares
>
> On Mon, Nov 23, 2015 at 3:19 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx>
> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Most people run their clusters with no RAID for the data disks (some
>> will run RAID for the journals, but we don't). We use the scrub
>> mechanism to find data inconsistency and we use three copies to do
>> RAID over host/racks, etc. Unless you have a specific need, it is best
>> to forgo the Linux SW RAID or even HW RAIDs too with Ceph.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Nov 23, 2015 at 10:09 AM, Jose Tavares  wrote:
>>> Hi guys ...
>>>
>>> Is there any advantage in running CEPH over a Linux SW-RAID to avoid data
>>> corruption due to disk bad blocks?
>>>
>>> Can we just rely on the scrubbing feature of CEPH? Can we live without an
>>> underlying layer that avoids hardware problems to be passed to CEPH?
>>>
>>> I have a setup where I put one OSD per node and I have a 2 disk raid-1
>>> setup. Is it a good option or it would be better if I had 2 OSDs, one in
>>> each disk? If I had one OSD per disk, I would have to increase the
>> number os
>>> replicas to guarantee enough replicas if one node goes down.
>>>
>>> Thanks a lot.
>>> Jose Tavares
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWU0qBCRDmVDuy+mK58QAAczAP/RducnXBNyeESCwUP/RC
>> 3ELmoZxMO2ymrcQoutUVXfPTZk7f9pINUux4NRnglbVDxHasmNBHFKV3uWTS
>> OBmaVuC99cwG/ekhmNaW9qmQIZiP8byijoDln26eqarhhuMECgbYxZhLtB9M
>> A1W5gpKEvCBvYcjW9V/rwb0+V678Eo1IVlezwJ1TP3pxvRWpDsg1dIhOBit8
>> PznnPTMS46RGFrFirTg1AfvmipSI3rhLFdR2g7xHrQs9UHdmC0OQ/Jcjnln+
>> L0LNni7ht1lK80J9Mk4Q/nt7HfWCxJrg497Q+R0m+ab3qFJWBUGwofjbEnut
>> JroMLph0sxAzmDSst8a15pzTYaIqMqKkGfGeHgiaNzePwELAY2AKwgx2AIlf
>> iYJCtyiXRHnfQfQEi1TflWFuEaaAhKCPqRO7Duf6a+rEsSkvViaZ9Mtm1bSX
>> KnLLSz8ZtXI4wTWbImXbpdhuGgHvKsEGWlU+YDuCil9i+PedM67us1Y6TAsT
>> UWvCd8P385psITLI37Ly+YDHphjyeyYljCPGuom1e+/J3flElS/BgWUGUibB
>> rA3QUNUIPWKO6F37JEDja13BShTE9I17Y3EpSgGGG3jnTt93/E4dEvR6mC/F
>> qPPjs7EMvc99Xi7rTqtpm58JLGXWh3rMgjITJTwfLhGtCHgSvvrsRjmGB9Xa
>> anPK
>> =XQGP
>> -----END PGP SIGNATURE-----
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux