You have a point, depends on your needs Based on recovery time and usage, I may find acceptable to lock write during recovery Thanks you for that insight On 21/06/2017 18:47, David Turner wrote: > I disagree that Replica 2 will ever truly be sane if you care about your > data. The biggest issue with replica 2 has nothing to do with drive > failures, restarting osds/nodes, power outages, etc. The biggest issue > with replica 2 is the min_size. If you set min_size to 2, then your data > is locked if you have any copy of the data unavailable. That's fine since > you were probably going to set min_size to 1... which you should never do > ever unless you don't care about your data. > > Too many pronouns, so we're going to say disk 1 and disk 2 are in charge of > a pg and the only 2 disks with a copy of the data. > The problem with a min_size of 1 is that if for any reason disk 1 is > inaccessible and a write is made to disk 2, then before disk 1 is fully > backfilled and caught up on all of the writes, disk 2 goes down... well now > your data is inaccessible, but that's not the issue. The issue is when > disk 1 comes back up first and the client tries to access the data that it > wrote earlier to disk 2... except the data isn't there. The client is > probably just showing an error somewhere and continuing. Now it makes some > writes to disk 1 before disk 2 finishes coming back up. What can these 2 > disks possibly do to ensure that your data is consistent when both of them > are back up? > > Now of course we reach THE QUESTION... How likely is this to ever happen > and what sort of things could cause it if not disk failures or performing > maintenance on your cluster? The answer to that is more common than you'd > like to think. Does your environment have enough RAM in your OSD nodes to > adequately handle recovery and not cycle into an OOM killer scenario? Will > you ever hit a bug in the code that causes an operation to a PG to segfault > an OSD? Those are both things that have happened to multiple clusters I've > managed and read about on the ML in the last year. A min_size of 1 would > very likely lead to data loss in either situation regardless of power > failures and disk failures. > > Now let's touch back on disk failures. While backfilling due to adding > storage, removing storage, or just balancing your cluster you are much more > likely to lose drives. During normal operation in a cluster, I would lose > about 6 drives in a year (2000+ OSDs). During backfilling (especially > adding multiple storage nodes), I would lose closer to 1-3 drives per major > backfilling operation. > > People keep asking about 2 replicas. People keep saying it's going to be > viable with bluestore. I care about my data too much to ever consider it. > If I was running a cluster where data loss was acceptable, then I would > absolutely consider it. If you're thinking about 5 nines of uptime, then 2 > replica will achieve that. If you're talking about 100% data integrity, > then 2 replica is not AND WILL NEVER BE for you (no matter what the release > docs say about bluestore). If space is your concern, start looking into > Erasure Coding. You can save more space and increase redundancy for the > cost of some performance. > > On Wed, Jun 21, 2017 at 10:56 AM <ceph@xxxxxxxxxxxxxx> wrote: > >> 2r on filestore == "I do not care about my data" >> >> This is not because of OSD's failure chance >> >> When you have a write error (ie data is badly written on the disk, >> without error reported), your data is just corrupted without hope of >> redemption >> >> Just as you expect your drives to die, expect your drives to "fail >> silently" >> >> With replica 3 and beyond, data CAN be repaired using quorum >> >> Replica 2 will become sane is the next release, with bluestore, which >> uses data checksums >> >> On 21/06/2017 16:51, Blair Bethwaite wrote: >>> Hi all, >>> >>> I'm doing some work to evaluate the risks involved in running 2r storage >>> pools. On the face of it my naive disk failure calculations give me 4-5 >>> nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary >> disk >>> failure based purely on chance of any 1 of the remaining 99 OSDs failing >>> within recovery time). 5 nines is just fine for our purposes, but of >> course >>> multiple disk failures are only part of the story. >>> >>> The more problematic issue with 2r clusters is that any time you do >> planned >>> maintenance (our clusters spend much more time degraded because of >> regular >>> upkeep than because of real failures) you're suddenly drastically >>> increasing the risk of data-loss. So I find myself wondering if there is >> a >>> way to tell Ceph I want an extra replica created for a particular PG or >> set >>> thereof, e.g., something that would enable the functional equivalent of: >>> "this OSD/node is going to go offline so please create a 3rd replica in >>> every PG it is participating in before we shutdown that/those OSD/s"...? >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com