Re: risk mitigation in 2 replica clusters

ceph@xxxxxxxxxxxxxx · Wed, 21 Jun 2017 19:03:20 +0200



You have a point, depends on your needs
Based on recovery time and usage, I may find acceptable to lock write
during recovery

Thanks you for that insight

On 21/06/2017 18:47, David Turner wrote:
> I disagree that Replica 2 will ever truly be sane if you care about your
> data.  The biggest issue with replica 2 has nothing to do with drive
> failures, restarting osds/nodes, power outages, etc.  The biggest issue
> with replica 2 is the min_size.  If you set min_size to 2, then your data
> is locked if you have any copy of the data unavailable.  That's fine since
> you were probably going to set min_size to 1... which you should never do
> ever unless you don't care about your data.
> 
> Too many pronouns, so we're going to say disk 1 and disk 2 are in charge of
> a pg and the only 2 disks with a copy of the data.
> The problem with a min_size of 1 is that if for any reason disk 1 is
> inaccessible and a write is made to disk 2, then before disk 1 is fully
> backfilled and caught up on all of the writes, disk 2 goes down... well now
> your data is inaccessible, but that's not the issue.  The issue is when
> disk 1 comes back up first and the client tries to access the data that it
> wrote earlier to disk 2... except the data isn't there.  The client is
> probably just showing an error somewhere and continuing.  Now it makes some
> writes to disk 1 before disk 2 finishes coming back up.  What can these 2
> disks possibly do to ensure that your data is consistent when both of them
> are back up?
> 
> Now of course we reach THE QUESTION... How likely is this to ever happen
> and what sort of things could cause it if not disk failures or performing
> maintenance on your cluster?  The answer to that is more common than you'd
> like to think.  Does your environment have enough RAM in your OSD nodes to
> adequately handle recovery and not cycle into an OOM killer scenario?  Will
> you ever hit a bug in the code that causes an operation to a PG to segfault
> an OSD?  Those are both things that have happened to multiple clusters I've
> managed and read about on the ML in the last year.  A min_size of 1 would
> very likely lead to data loss in either situation regardless of power
> failures and disk failures.
> 
> Now let's touch back on disk failures.  While backfilling due to adding
> storage, removing storage, or just balancing your cluster you are much more
> likely to lose drives.  During normal operation in a cluster, I would lose
> about 6 drives in a year (2000+ OSDs).  During backfilling (especially
> adding multiple storage nodes), I would lose closer to 1-3 drives per major
> backfilling operation.
> 
> People keep asking about 2 replicas.  People keep saying it's going to be
> viable with bluestore.  I care about my data too much to ever consider it.
> If I was running a cluster where data loss was acceptable, then I would
> absolutely consider it.  If you're thinking about 5 nines of uptime, then 2
> replica will achieve that.  If you're talking about 100% data integrity,
> then 2 replica is not AND WILL NEVER BE for you (no matter what the release
> docs say about bluestore).  If space is your concern, start looking into
> Erasure Coding.   You can save more space and increase redundancy for the
> cost of some performance.
> 
> On Wed, Jun 21, 2017 at 10:56 AM <ceph@xxxxxxxxxxxxxx> wrote:
> 
>> 2r on filestore == "I do not care about my data"
>>
>> This is not because of OSD's failure chance
>>
>> When you have a write error (ie data is badly written on the disk,
>> without error reported), your data is just corrupted without hope of
>> redemption
>>
>> Just as you expect your drives to die, expect your drives to "fail
>> silently"
>>
>> With replica 3 and beyond, data CAN be repaired using quorum
>>
>> Replica 2 will become sane is the next release, with bluestore, which
>> uses data checksums
>>
>> On 21/06/2017 16:51, Blair Bethwaite wrote:
>>> Hi all,
>>>
>>> I'm doing some work to evaluate the risks involved in running 2r storage
>>> pools. On the face of it my naive disk failure calculations give me 4-5
>>> nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary
>> disk
>>> failure based purely on chance of any 1 of the remaining 99 OSDs failing
>>> within recovery time). 5 nines is just fine for our purposes, but of
>> course
>>> multiple disk failures are only part of the story.
>>>
>>> The more problematic issue with 2r clusters is that any time you do
>> planned
>>> maintenance (our clusters spend much more time degraded because of
>> regular
>>> upkeep than because of real failures) you're suddenly drastically
>>> increasing the risk of data-loss. So I find myself wondering if there is
>> a
>>> way to tell Ceph I want an extra replica created for a particular PG or
>> set
>>> thereof, e.g., something that would enable the functional equivalent of:
>>> "this OSD/node is going to go offline so please create a 3rd replica in
>>> every PG it is participating in before we shutdown that/those OSD/s"...?
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com