Re: CEPH over SW-RAID

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 23 Nov 2015 19:07:18 +0100

So I assume we _are_ talking about bit-rot?

> On 23 Nov 2015, at 18:37, Jose Tavares <jat@xxxxxxxxxxxx> wrote:
> 
> Yes, but with SW-RAID, when we have a block that was read and does not
> match its checksum, the device falls out of the array, and the data is read
> again from the other devices in the array.

That's not true. SW-RAID reads data from one drive only. Comparison of the data on different drives only happens when a check is executed, and that's doesn't help with bit-rot one bit :-) (the same goes for various SANs and arrays, but those usually employ additional CRC for data so their BER is orders of magnitude higher.)

> The problem is that in SW-RAID1
> we don't have the badblocks isolated. The disks can be sincronized again as
> the write operation is not tested. The problem (device falling out of the
> array) will happen again if we try to read any other data written over the
> bad block.

Not true either. Bit-rot happens not (only) when the data gets written wrong, but when it is read. If you read one block long enough you will get wrong data once every $BER_bits. Rewriting the data doesn't help.
(It's a bit different with some SSDs that don't refresh blocks so rewriting/refreshing them might help).

> 
> My new question regarding Ceph is if it isolates this bad sectors where it
> found bad data when scrubbing? or there will be always a replica of
> something over a known bad block..?
> 
> I also saw that Ceph use same metrics when capturing data from disks. When
> the disk is resetting or have problems, its metrics are going to be bad and
> the cluster will rank bad this osd. But I didn't saw any way of sending
> alerts or anything like that. SW-RAID has its mdadm monitor that alerts
> when things go bad. Should I have to be looking for ceph logs all the time
> to see when things go bad?

You should graph every drive and look for anomalies. Ceph only detects a problem when the drive is already very unusable (the ceph-osd process itself blocks for tens of seconds typically).
CEPH is not really good when it comes to latency SLAs, no matter how much you try, but that's usually sufficient.

> 
> Thanks.
> Jose Tavares
> 
> On Mon, Nov 23, 2015 at 3:19 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx>
> wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>> 
>> Most people run their clusters with no RAID for the data disks (some
>> will run RAID for the journals, but we don't). We use the scrub
>> mechanism to find data inconsistency and we use three copies to do
>> RAID over host/racks, etc. Unless you have a specific need, it is best
>> to forgo the Linux SW RAID or even HW RAIDs too with Ceph.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> 
>> 
>> On Mon, Nov 23, 2015 at 10:09 AM, Jose Tavares  wrote:
>>> Hi guys ...
>>> 
>>> Is there any advantage in running CEPH over a Linux SW-RAID to avoid data
>>> corruption due to disk bad blocks?
>>> 
>>> Can we just rely on the scrubbing feature of CEPH? Can we live without an
>>> underlying layer that avoids hardware problems to be passed to CEPH?
>>> 
>>> I have a setup where I put one OSD per node and I have a 2 disk raid-1
>>> setup. Is it a good option or it would be better if I had 2 OSDs, one in
>>> each disk? If I had one OSD per disk, I would have to increase the
>> number os
>>> replicas to guarantee enough replicas if one node goes down.
>>> 
>>> Thanks a lot.
>>> Jose Tavares
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>> 
>> wsFcBAEBCAAQBQJWU0qBCRDmVDuy+mK58QAAczAP/RducnXBNyeESCwUP/RC
>> 3ELmoZxMO2ymrcQoutUVXfPTZk7f9pINUux4NRnglbVDxHasmNBHFKV3uWTS
>> OBmaVuC99cwG/ekhmNaW9qmQIZiP8byijoDln26eqarhhuMECgbYxZhLtB9M
>> A1W5gpKEvCBvYcjW9V/rwb0+V678Eo1IVlezwJ1TP3pxvRWpDsg1dIhOBit8
>> PznnPTMS46RGFrFirTg1AfvmipSI3rhLFdR2g7xHrQs9UHdmC0OQ/Jcjnln+
>> L0LNni7ht1lK80J9Mk4Q/nt7HfWCxJrg497Q+R0m+ab3qFJWBUGwofjbEnut
>> JroMLph0sxAzmDSst8a15pzTYaIqMqKkGfGeHgiaNzePwELAY2AKwgx2AIlf
>> iYJCtyiXRHnfQfQEi1TflWFuEaaAhKCPqRO7Duf6a+rEsSkvViaZ9Mtm1bSX
>> KnLLSz8ZtXI4wTWbImXbpdhuGgHvKsEGWlU+YDuCil9i+PedM67us1Y6TAsT
>> UWvCd8P385psITLI37Ly+YDHphjyeyYljCPGuom1e+/J3flElS/BgWUGUibB
>> rA3QUNUIPWKO6F37JEDja13BShTE9I17Y3EpSgGGG3jnTt93/E4dEvR6mC/F
>> qPPjs7EMvc99Xi7rTqtpm58JLGXWh3rMgjITJTwfLhGtCHgSvvrsRjmGB9Xa
>> anPK
>> =XQGP
>> -----END PGP SIGNATURE-----
>> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com