Re: risk mitigation in 2 replica clusters

David Turner <drakonstein@xxxxxxxxx> · Wed, 21 Jun 2017 16:47:41 +0000

I disagree that Replica 2 will ever truly be sane if you care about your data.  The biggest issue with replica 2 has nothing to do with drive failures, restarting osds/nodes, power outages, etc.  The biggest issue with replica 2 is the min_size.  If you set min_size to 2, then your data is locked if you have any copy of the data unavailable.  That's fine since you were probably going to set min_size to 1... which you should never do ever unless you don't care about your data.
Too many pronouns, so we're going to say disk 1 and disk 2 are in charge of a pg and the only 2 disks with a copy of the data.
The problem with a min_size of 1 is that if for any reason disk 1 is inaccessible and a write is made to disk 2, then before disk 1 is fully backfilled and caught up on all of the writes, disk 2 goes down... well now your data is inaccessible, but that's not the issue.  The issue is when disk 1 comes back up first and the client tries to access the data that it wrote earlier to disk 2... except the data isn't there.  The client is probably just showing an error somewhere and continuing.  Now it makes some writes to disk 1 before disk 2 finishes coming back up.  What can these 2 disks possibly do to ensure that your data is consistent when both of them are back up?

Now of course we reach THE QUESTION... How likely is this to ever happen and what sort of things could cause it if not disk failures or performing maintenance on your cluster?  The answer to that is more common than you'd like to think.  Does your environment have enough RAM in your OSD nodes to adequately handle recovery and not cycle into an OOM killer scenario?  Will you ever hit a bug in the code that causes an operation to a PG to segfault an OSD?  Those are both things that have happened to multiple clusters I've managed and read about on the ML in the last year.  A min_size of 1 would very likely lead to data loss in either situation regardless of power failures and disk failures.

Now let's touch back on disk failures.  While backfilling due to adding storage, removing storage, or just balancing your cluster you are much more likely to lose drives.  During normal operation in a cluster, I would lose about 6 drives in a year (2000+ OSDs).  During backfilling (especially adding multiple storage nodes), I would lose closer to 1-3 drives per major backfilling operation.

People keep asking about 2 replicas.  People keep saying it's going to be viable with bluestore.  I care about my data too much to ever consider it.  If I was running a cluster where data loss was acceptable, then I would absolutely consider it.  If you're thinking about 5 nines of uptime, then 2 replica will achieve that.  If you're talking about 100% data integrity, then 2 replica is not AND WILL NEVER BE for you (no matter what the release docs say about bluestore).  If space is your concern, start looking into Erasure Coding.   You can save more space and increase redundancy for the cost of some performance.

On Wed, Jun 21, 2017 at 10:56 AM <ceph@xxxxxxxxxxxxxx> wrote:
2r on filestore == "I do not care about my data"

This is not because of OSD's failure chance

When you have a write error (ie data is badly written on the disk,

without error reported), your data is just corrupted without hope of

redemption

Just as you expect your drives to die, expect your drives to "fail silently"

With replica 3 and beyond, data CAN be repaired using quorum

Replica 2 will become sane is the next release, with bluestore, which

uses data checksums

On 21/06/2017 16:51, Blair Bethwaite wrote:

> Hi all,

>

> I'm doing some work to evaluate the risks involved in running 2r storage

> pools. On the face of it my naive disk failure calculations give me 4-5

> nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary disk

> failure based purely on chance of any 1 of the remaining 99 OSDs failing

> within recovery time). 5 nines is just fine for our purposes, but of course

> multiple disk failures are only part of the story.

>

> The more problematic issue with 2r clusters is that any time you do planned

> maintenance (our clusters spend much more time degraded because of regular

> upkeep than because of real failures) you're suddenly drastically

> increasing the risk of data-loss. So I find myself wondering if there is a

> way to tell Ceph I want an extra replica created for a particular PG or set

> thereof, e.g., something that would enable the functional equivalent of:

> "this OSD/node is going to go offline so please create a 3rd replica in

> every PG it is participating in before we shutdown that/those OSD/s"...?

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com