Re: Is redundancy across failure domains guaranteed or best effort?

David Turner <drakonstein@xxxxxxxxx> · Fri, 14 Apr 2017 16:04:02 +0000

The status will be a WARN in this case.

On Fri, Apr 14, 2017 at 12:01 PM Adam Carheden <carheden@xxxxxxxx> wrote:
Thanks for your replies.

I think the sort version is "guaranteed": CEPH will always either store

'size' copies of your data or set heath to a WARN and/or ERR state to

let you know that it can't. I think that's probably the most desirable

answer.

--

Adam Carheden

On 04/14/2017 09:51 AM, David Turner wrote:

> If you have Replica size 3, your failure domain is host, and you have 3

> servers... you will NEVER have 2 copies of the data on 1 server.  If you

> weight your OSDs poorly on one of your servers, then one of the drives

> will fill up to the full ratio in its config and stop receiving writes.

> You should always monitor your OSDs so that you can fix the weights

> before an OSD becomes nearfull and definitely so that the OSD never

> reaches the FULL setting and stops receiving writes.  Note that when it

> stops receiving writes, it will block the write requests and until it

> has space to fulfill the write and the cluster will be stuck.

>

> Also to truly answer your question, if you had Replica size 3, your

> failure domain is host, and you only have 2 servers in your cluster...

> You will only be storing 2 copies of data and every single PG in your

> cluster will be degraded.  Ceph will never breach the boundary of your

> failure domain.

>

> When dealing with 3 node clusters you want to be careful to never fill

> up your cluster past a % where you can lose a drive in one of your

> nodes.  For example, if you have 3 nodes with 3x 4TB drives in each and

> you lose a drive... the other 2 OSDs in that node need to be able to

> take the data from the dead drive without going over 80% (the default

> nearfull setting).  So in this scenario you shouldn't fill the cluster

> to be more than 53% unless you're planning to tell the cluster not to

> backfill until the dead OSD is replaced.

>

> I will never recommend anyone to go into production with a cluster

> smaller than N+2 your replica size of failure domains.  So if you have

> the default Replica size of 3, then you should go into production with

> at least 5 servers.  This gives you enough failure domains to be able to

> handle drive failures without the situation being critical.

>

> On Fri, Apr 14, 2017 at 11:25 AM Adam Carheden <carheden@xxxxxxxx

> <mailto:carheden@xxxxxxxx>> wrote:

>

>     Is redundancy across failure domains guaranteed or best effort?

>

>     Note: The best answer to the questions below is obviously to avoid the

>     situation by properly weight drives and not approaching the full ratio.

>     I'm just curious how CEPH works.

>

>     Hypothetical situation:

>     Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you

>     weighted the OSDs poorly such that the OSDs on one server filled up but

>     the OSDs on the others still had space. CEPH could still store 3

>     replicas of your data, but two of them would be on the same server. What

>     happens?

>

>     (select all that apply)

>     a.[ ] Clients can still read data

>     b.[ ] Clients can still write data

>     c.[ ] health = HEALTH_WARN

>     d.[ ] health = HEALTH_OK

>     e.[ ] PGs are degraded

>     f.[ ] ceph stores only two copies of data

>     g.[ ] ceph stores 3 copies of data, two of which are on the same server

>     h.[ ] something else?

>

>     If the answer is "best effort" (a+b+d+g), how would you detect if that

>     scenario is occurring?

>

>     If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in

>     that scenario, is there any way to tell CEPH to store temporarily store

>     2 copies on a single server just in case? I suspect the answer is to

>     remove host bucket from the crushmap but that that's a really bad idea

>     because it would trigger a rebuild and the extra disk activity increases

>     the likelihood of additional drive failures, correct?

>

>     --

>     Adam Carheden

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com