Re: CEPH failure domain - power considerations

Brian Topping <brian.topping@xxxxxxxxx> · Fri, 29 May 2020 11:44:21 -0600

Phil, this would be an excellent contribution to the blog or the introductory documentation. I’ve been using Ceph for over a year this brought together a lot of concepts that I hadn’t related so succinctly in the past. 

One of the things that I hadn’t really conceptualized well was “why size of 3?” I knew that PGs went to read-only without a quorum of OSDs to write to, but this is a much simpler way to think about it. 

Something I have been experimenting with that might also be interesting to the discussion is “when to use redundancy at all”. Kafka is a good example of "eventually consistent" software that is designed for complete node failure and extremely high performance. If Kafka is backed by a replicated pool, I’ve come to believe this is suboptimal compared to having three Kafka instances, each without replication in Ceph. 

The logical question is “why use Ceph at all then?” To me, this is about centralized management process. If I am building with Ceph in most places, using it everywhere creates operational consistency. (Modifying CRUSH maps is the path to enabling unreplicated storage that is pinned to a specific machine that also contains the Kafka workload.)

At any rate, eventually consistent software packages can provide additional options for top level failure domain requirements. 

Brian

> On May 29, 2020, at 10:48 AM, <DHilsbos@xxxxxxxxxxxxxx> <DHilsbos@xxxxxxxxxxxxxx> wrote:
> 
> Phil;
> 
> I like to refer to basic principles, and design assumptions / choices when considering things like this.  I also like to refer to more broadly understood technologies.  Finally; I'm still relatively new to Ceph, so here it goes...
> 
> TLDR: Ceph is (likes to be) double-redundent (like RAID-6), while dual power (n+1) is single-redundant.
> 
> Like RAID, Ceph (or more precisely a Ceph pool) can be in, and moves through, the following states:
> 
> Normal --> Partially Failed (degraded) --> Recovering --> Normal.
> 
> When talking about these systems, we often gloss over Recovery, acting as if it takes no time.  Recovery does take time though, and if anything ELSE happens while recovery is ongoing, what can the software do?
> 
> Think RAID-5; what happens if a drive fails in a RAID-5 array, and during recovery an unreadable block is found on another drive?  That's single redundancy.  If you use RAID-6, the array goes to the second redundancy level, and the recovery continues.
> 
> As a result of the long recovery times expected of modern large hard-drives, Ceph pushes for double-redundancy (3x replication, 5-2 EC).  Further, it decreases availability the more redundancy is degraded (i.e. when the first layer of redundancy is compromised, writes are still allowed.  When the second is lost, writes are disallowed, but reads are allowed.  Only when all three layers are compromised are reads disallowed).
> 
> Dual power feeds (n+1) is only single-redundant, thus the entire system can't achieve better than single-redundancy.  Depending on the reliability of the power, and your service guarantees, this may be acceptable.
> 
> If you add ATSs, then you need to look at the failure rate (MTBF, or similar) to determine if your service guarantees are impacted.
> 
> Dominic L. Hilsbos, MBA 
> Director – Information Technology 
> Perform Air International Inc.
> DHilsbos@xxxxxxxxxxxxxx 
> www.PerformAir.com
> 
> 
> -----Original Message-----
> From: Phil Regnauld [mailto:pr@xxxxx] 
> Sent: Friday, May 29, 2020 12:59 AM
> To: Hans van den Bogert
> Cc: ceph-users@xxxxxxx
> Subject:  Re: CEPH failure domain - power considerations
> 
> Hans van den Bogert (hansbogert) writes:
>> I would second that, there's no winning in this case for your 
>> requirements and single PSU nodes. If there were 3 feeds,  then yes; 
>> you could make an extra layer in your crushmap much like you would 
>> incorporate a rack topology in the crushmap.
> 
> 	I'm not fully up on coffee for today, so I haven't yet worked out why
> 	3 feeds would help ? To have a 'tie breaker' of sorts, with hosts spread
> 	across 3 rails ?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx