Re: Is redundancy across failure domains guaranteed or best effort?

Adam Carheden <carheden@xxxxxxxx> · Fri, 14 Apr 2017 10:00:49 -0600

Thanks for your replies.

I think the sort version is "guaranteed": CEPH will always either store
'size' copies of your data or set heath to a WARN and/or ERR state to
let you know that it can't. I think that's probably the most desirable
answer.

-- 
Adam Carheden

On 04/14/2017 09:51 AM, David Turner wrote:
> If you have Replica size 3, your failure domain is host, and you have 3
> servers... you will NEVER have 2 copies of the data on 1 server.  If you
> weight your OSDs poorly on one of your servers, then one of the drives
> will fill up to the full ratio in its config and stop receiving writes. 
> You should always monitor your OSDs so that you can fix the weights
> before an OSD becomes nearfull and definitely so that the OSD never
> reaches the FULL setting and stops receiving writes.  Note that when it
> stops receiving writes, it will block the write requests and until it
> has space to fulfill the write and the cluster will be stuck.
> 
> Also to truly answer your question, if you had Replica size 3, your
> failure domain is host, and you only have 2 servers in your cluster...
> You will only be storing 2 copies of data and every single PG in your
> cluster will be degraded.  Ceph will never breach the boundary of your
> failure domain.
> 
> When dealing with 3 node clusters you want to be careful to never fill
> up your cluster past a % where you can lose a drive in one of your
> nodes.  For example, if you have 3 nodes with 3x 4TB drives in each and
> you lose a drive... the other 2 OSDs in that node need to be able to
> take the data from the dead drive without going over 80% (the default
> nearfull setting).  So in this scenario you shouldn't fill the cluster
> to be more than 53% unless you're planning to tell the cluster not to
> backfill until the dead OSD is replaced.
> 
> I will never recommend anyone to go into production with a cluster
> smaller than N+2 your replica size of failure domains.  So if you have
> the default Replica size of 3, then you should go into production with
> at least 5 servers.  This gives you enough failure domains to be able to
> handle drive failures without the situation being critical.
> 
> On Fri, Apr 14, 2017 at 11:25 AM Adam Carheden <carheden@xxxxxxxx
> <mailto:carheden@xxxxxxxx>> wrote:
> 
>     Is redundancy across failure domains guaranteed or best effort?
> 
>     Note: The best answer to the questions below is obviously to avoid the
>     situation by properly weight drives and not approaching the full ratio.
>     I'm just curious how CEPH works.
> 
>     Hypothetical situation:
>     Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you
>     weighted the OSDs poorly such that the OSDs on one server filled up but
>     the OSDs on the others still had space. CEPH could still store 3
>     replicas of your data, but two of them would be on the same server. What
>     happens?
> 
>     (select all that apply)
>     a.[ ] Clients can still read data
>     b.[ ] Clients can still write data
>     c.[ ] health = HEALTH_WARN
>     d.[ ] health = HEALTH_OK
>     e.[ ] PGs are degraded
>     f.[ ] ceph stores only two copies of data
>     g.[ ] ceph stores 3 copies of data, two of which are on the same server
>     h.[ ] something else?
> 
>     If the answer is "best effort" (a+b+d+g), how would you detect if that
>     scenario is occurring?
> 
>     If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in
>     that scenario, is there any way to tell CEPH to store temporarily store
>     2 copies on a single server just in case? I suspect the answer is to
>     remove host bucket from the crushmap but that that's a really bad idea
>     because it would trigger a rebuild and the extra disk activity increases
>     the likelihood of additional drive failures, correct?
> 
>     --
>     Adam Carheden
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com