Re: NVMe and 2x Replica

Frank Schilder <frans@xxxxxx> · Fri, 5 Feb 2021 15:19:06 +0000

I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1 that really is dangerous, because it accepts non-redundant writes.

Even if you loose the second site entirely, you can always re-sync from scratch - assuming decent network bandwidth.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Adam Boyhan <adamb@xxxxxxxxxx>
Sent: 05 February 2021 13:58:34
To: Frank Schilder
Cc: Jack; ceph-users
Subject: Re:  Re: NVMe and 2x Replica

This turned into a great thread.  Lots of good information and clarification.

I am 100% on board with 3 copies for the primary.

What does everyone think about possibly only doing 2 copies on the secondary?  Keeping in mind that I would keep min=2 which I think will be reasonable for a secondary site.

________________________________
From: "Frank Schilder" <frans@xxxxxx>
To: "Jack" <ceph@xxxxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxx>
Sent: Friday, February 5, 2021 7:14:52 AM
Subject:  Re: NVMe and 2x Replica

> Picture this, using size=3, min_size=2:
> - One node is down for maintenance
> - You loose a couple of devices
> - You loose data
>
> Is it likely that a nvme device dies during a short maintenance window ?
> Is it likely that two devices dies at the same time ?

If you just look at it from this narrow point of view of fundamental laws of nature, then, yes, 2+1 is safe. As safe as is nuclear power just looking at the laws of physics. So why then did Chernobyl and Fukushima happen? Its because its operated by humans. If you look around, the No. 1 reason for loosing data on ceph or entire clusters is 2+1.

Look at the reasons. Its rarely a broken disk. A system designed with no redundancy that offers a margin for error will suffer from every little admin mistake, undetected race condition, bug in ceph or bug in firmware. So, if the savings are worth the sweat, downtime and consultancy budget, why not?

Ceph has infinite uptime. During such a long period, low-probability events will happen with probability 1.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Jack <ceph@xxxxxxxxxxxxxx>
Sent: 05 February 2021 12:48:33
To: ceph-users@xxxxxxx
Subject:  Re: NVMe and 2x Replica

At the end, this is nothing but a probability stuff

Picture this, using size=3, min_size=2:
- One node is down for maintenance
- You loose a couple of devices
- You loose data

Is it likely that a nvme device dies during a short maintenance window ?
Is it likely that two devices dies at the same time ?

What are the numbers ?

On 2/5/21 12:26 PM, Wido den Hollander wrote:
>
>
> On 04/02/2021 18:57, Adam Boyhan wrote:
>> All great input and points guys.
>>
>> Helps me lean towards 3 copes a bit more.
>>
>> I mean honestly NVMe cost per TB isn't that much more than SATA SSD
>> now. Somewhat surprised the salesmen aren't pitching 3x replication as
>> it makes them more money.
>
> To add to this, I have seen real cases as a Ceph consultant where size=2
> and min_size=1 on all flash lead to data loss.
>
> Picture this:
>
> - One node is down (Maintenance, failure, etc, etc)
> - NVMe device in other node dies
> - You loose data
>
> Although you can bring back the other node which was down but not broken
> you are missing data. The data on the NVMe devices in there is outdated
> and thus the PGs will not become active.
>
> size=2 is only safe with min_size=2, but that doesn't really provide HA.
>
> The same goes with ZFS in mirror, raidz1, etc. If you loose one device
> the chances are real you loose the other device before the array has
> healed itself.
>
> With Ceph it's slighly more complex, but the same principles apply.
>
> No, with NVMe I still would highly advise against using size=2, min_size=1
>
> The question is not if you will loose data, but the question is: When
> will you loose data? Within one year, 2? 3? 10?
>
> Wido
>
>>
>>
>>
>> From: "Anthony D'Atri" <anthony.datri@xxxxxxxxx>
>> To: "ceph-users" <ceph-users@xxxxxxx>
>> Sent: Thursday, February 4, 2021 12:47:27 PM
>> Subject:  Re: NVMe and 2x Replica
>>
>>> I searched each to find the section where 2x was discussed. What I
>>> found was interesting. First, there are really only 2 positions here:
>>> Micron's and Red Hat's. Supermicro copies Micron's positon paragraph
>>> word for word. Not surprising considering that they are advertising a
>>> Supermicro / Micron solution.
>>
>> FWIW, at Cephalocon another vendor made a similar claim during a talk.
>>
>> * Failure rates are averages, not minima. Some drives will always fail
>> sooner
>> * Firmware and other design flaws can result in much higher rates of
>> failure or insidious UREs that can result in partial data
>> unavailability or loss
>> * Latent soft failures may not be detected until a deep scrub
>> succeeds, which could be weeks later
>> * In a distributed system, there are up/down/failure scenarios where
>> the location of even one good / canonical / latest copy of data is
>> unclear, especially when drive or HBA cache is in play.
>> * One of these is a power failure. Sure PDU / PSU redundancy helps,
>> but stuff happens, like a DC underprovisioning amps, so that a spike
>> in user traffic results in the whole row going down :-x Various
>> unpleasant things can happen.
>>
>> I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID.
>> As others have written, as drives get larger the time to fill them
>> with replica data increases, as does the chance of overlapping
>> failures. I’ve experieneced R2 overlapping failures more than once,
>> with and before Ceph.
>>
>> My sense has been that not many people run R2 for data they care
>> about, and as has been written recently 2,2 EC is safer with the same
>> raw:usable ratio. I’ve figured that vendors make R2 statements like
>> these as a selling point to assert lower TCO. My first response is
>> often “How much would it cost you directly, and indirectly in terms of
>> user / customer goodwill, to loose data?”.
>>
>>> Personally, this looks like marketing BS to me. SSD shops want to
>>> sell SSDs, but because of the cost difference they have to convince
>>> buyers that their products are competitive.
>>
>> ^this. I’m watching the QLC arena with interest for the potential to
>> narrow the CapEx gap. Durability has been one concern, though I’m
>> seeing newer products claiming that eg. ZNS improves that. It also
>> seems that there are something like what, *4* separate EDSFF / ruler
>> form factors, I really want to embrace those eg. for object clusters,
>> but I’m VERY wary of the longevity of competing standards and any
>> single-source for chassies or drives.
>>
>>> Our products cost twice as much, but LOOK you only need 2/3 as many,
>>> and you get all these other benefits (performance). Plus, if you
>>> replace everything in 2 or 3 years anyway, then you won't have to
>>> worry about them failing.
>>
>> Refresh timelines. You’re funny ;) Every time, every single time, that
>> I’ve worked in an organization that claims a 3 (or 5, or whatever)
>> hardware refresh cycle, it hasn’t happened. When you start getting
>> close, the capex doesn’t materialize, or the opex cost of DC hands and
>> operational oversight. “How do you know that the drives will start
>> failing or getting slower? Let’s revisit this in 6 months”. Etc.
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx