Re: CEPH over SW-RAID

Jose Tavares <jat@xxxxxxxxxxxx> · Mon, 23 Nov 2015 16:43:16 -0200

On Mon, Nov 23, 2015 at 4:07 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
So I assume we _are_ talking about bit-rot?

> On 23 Nov 2015, at 18:37, Jose Tavares <jat@xxxxxxxxxxxx> wrote:

>

> Yes, but with SW-RAID, when we have a block that was read and does not

> match its checksum, the device falls out of the array, and the data is read

> again from the other devices in the array.

That's not true. SW-RAID reads data from one drive only. Comparison of the data on different drives only happens when a check is executed, and that's doesn't help with bit-rot one bit :-) (the same goes for various SANs and arrays, but those usually employ additional CRC for data so their BER is orders of magnitude higher.)

SW-RAID reads one data from one drive each time. But the drive itself has its data checksum in hardware.

"In daily business, your hard disk does write a checksum and some ECC information for every sector being written, and verifies this data during a read operation."
http://serverfault.com/questions/645862/when-does-a-raid-restore-redundancy-after-a-broken-sector-is-flagged-as-defectiv

"If the disk is out of replacement sectors..." .. This is the most common scenario we see these days .. so, the OS must deal with this bad blocks.

> The problem is that in SW-RAID1

> we don't have the badblocks isolated. The disks can be sincronized again as

> the write operation is not tested. The problem (device falling out of the

> array) will happen again if we try to read any other data written over the

> bad block.

Not true either. Bit-rot happens not (only) when the data gets written wrong, but when it is read. If you read one block long enough you will get wrong data once every $BER_bits. Rewriting the data doesn't help.

(It's a bit different with some SSDs that don't refresh blocks so rewriting/refreshing them might help).

>

> My new question regarding Ceph is if it isolates this bad sectors where it

> found bad data when scrubbing? or there will be always a replica of

> something over a known bad block..?

>

> I also saw that Ceph use same metrics when capturing data from disks. When

> the disk is resetting or have problems, its metrics are going to be bad and

> the cluster will rank bad this osd. But I didn't saw any way of sending

> alerts or anything like that. SW-RAID has its mdadm monitor that alerts

> when things go bad. Should I have to be looking for ceph logs all the time

> to see when things go bad?

You should graph every drive and look for anomalies. Ceph only detects a problem when the drive is already very unusable (the ceph-osd process itself blocks for tens of seconds typically).

CEPH is not really good when it comes to latency SLAs, no matter how much you try, but that's usually sufficient.

>

> Thanks.

> Jose Tavares

>

> On Mon, Nov 23, 2015 at 3:19 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx>

> wrote:

>

>> -----BEGIN PGP SIGNED MESSAGE-----

>> Hash: SHA256

>>

>> Most people run their clusters with no RAID for the data disks (some

>> will run RAID for the journals, but we don't). We use the scrub

>> mechanism to find data inconsistency and we use three copies to do

>> RAID over host/racks, etc. Unless you have a specific need, it is best

>> to forgo the Linux SW RAID or even HW RAIDs too with Ceph.

>> - ----------------

>> Robert LeBlanc

>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

>>

>>

>> On Mon, Nov 23, 2015 at 10:09 AM, Jose Tavares  wrote:

>>> Hi guys ...

>>>

>>> Is there any advantage in running CEPH over a Linux SW-RAID to avoid data

>>> corruption due to disk bad blocks?

>>>

>>> Can we just rely on the scrubbing feature of CEPH? Can we live without an

>>> underlying layer that avoids hardware problems to be passed to CEPH?

>>>

>>> I have a setup where I put one OSD per node and I have a 2 disk raid-1

>>> setup. Is it a good option or it would be better if I had 2 OSDs, one in

>>> each disk? If I had one OSD per disk, I would have to increase the

>> number os

>>> replicas to guarantee enough replicas if one node goes down.

>>>

>>> Thanks a lot.

>>> Jose Tavares

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>>

>> -----BEGIN PGP SIGNATURE-----

>> Version: Mailvelope v1.2.3

>> Comment: https://www.mailvelope.com

>>

>> wsFcBAEBCAAQBQJWU0qBCRDmVDuy+mK58QAAczAP/RducnXBNyeESCwUP/RC

>> 3ELmoZxMO2ymrcQoutUVXfPTZk7f9pINUux4NRnglbVDxHasmNBHFKV3uWTS

>> OBmaVuC99cwG/ekhmNaW9qmQIZiP8byijoDln26eqarhhuMECgbYxZhLtB9M

>> A1W5gpKEvCBvYcjW9V/rwb0+V678Eo1IVlezwJ1TP3pxvRWpDsg1dIhOBit8

>> PznnPTMS46RGFrFirTg1AfvmipSI3rhLFdR2g7xHrQs9UHdmC0OQ/Jcjnln+

>> L0LNni7ht1lK80J9Mk4Q/nt7HfWCxJrg497Q+R0m+ab3qFJWBUGwofjbEnut

>> JroMLph0sxAzmDSst8a15pzTYaIqMqKkGfGeHgiaNzePwELAY2AKwgx2AIlf

>> iYJCtyiXRHnfQfQEi1TflWFuEaaAhKCPqRO7Duf6a+rEsSkvViaZ9Mtm1bSX

>> KnLLSz8ZtXI4wTWbImXbpdhuGgHvKsEGWlU+YDuCil9i+PedM67us1Y6TAsT

>> UWvCd8P385psITLI37Ly+YDHphjyeyYljCPGuom1e+/J3flElS/BgWUGUibB

>> rA3QUNUIPWKO6F37JEDja13BShTE9I17Y3EpSgGGG3jnTt93/E4dEvR6mC/F

>> qPPjs7EMvc99Xi7rTqtpm58JLGXWh3rMgjITJTwfLhGtCHgSvvrsRjmGB9Xa

>> anPK

>> =XQGP

>> -----END PGP SIGNATURE-----

>>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com