Re: Is redundancy across failure domains guaranteed or best effort?

David Turner <drakonstein@xxxxxxxxx> · Fri, 14 Apr 2017 15:51:04 +0000

If you have Replica size 3, your failure domain is host, and you have 3 servers... you will NEVER have 2 copies of the data on 1 server.  If you weight your OSDs poorly on one of your servers, then one of the drives will fill up to the full ratio in its config and stop receiving writes.  You should always monitor your OSDs so that you can fix the weights before an OSD becomes nearfull and definitely so that the OSD never reaches the FULL setting and stops receiving writes.  Note that when it stops receiving writes, it will block the write requests and until it has space to fulfill the write and the cluster will be stuck.
Also to truly answer your question, if you had Replica size 3, your failure domain is host, and you only have 2 servers in your cluster... You will only be storing 2 copies of data and every single PG in your cluster will be degraded.  Ceph will never breach the boundary of your failure domain.

When dealing with 3 node clusters you want to be careful to never fill up your cluster past a % where you can lose a drive in one of your nodes.  For example, if you have 3 nodes with 3x 4TB drives in each and you lose a drive... the other 2 OSDs in that node need to be able to take the data from the dead drive without going over 80% (the default nearfull setting).  So in this scenario you shouldn't fill the cluster to be more than 53% unless you're planning to tell the cluster not to backfill until the dead OSD is replaced.

I will never recommend anyone to go into production with a cluster smaller than N+2 your replica size of failure domains.  So if you have the default Replica size of 3, then you should go into production with at least 5 servers.  This gives you enough failure domains to be able to handle drive failures without the situation being critical.

On Fri, Apr 14, 2017 at 11:25 AM Adam Carheden <carheden@xxxxxxxx> wrote:
Is redundancy across failure domains guaranteed or best effort?

Note: The best answer to the questions below is obviously to avoid the

situation by properly weight drives and not approaching the full ratio.

I'm just curious how CEPH works.

Hypothetical situation:

Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you

weighted the OSDs poorly such that the OSDs on one server filled up but

the OSDs on the others still had space. CEPH could still store 3

replicas of your data, but two of them would be on the same server. What

happens?

(select all that apply)

a.[ ] Clients can still read data

b.[ ] Clients can still write data

c.[ ] health = HEALTH_WARN

d.[ ] health = HEALTH_OK

e.[ ] PGs are degraded

f.[ ] ceph stores only two copies of data

g.[ ] ceph stores 3 copies of data, two of which are on the same server

h.[ ] something else?

If the answer is "best effort" (a+b+d+g), how would you detect if that

scenario is occurring?

If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in

that scenario, is there any way to tell CEPH to store temporarily store

2 copies on a single server just in case? I suspect the answer is to

remove host bucket from the crushmap but that that's a really bad idea

because it would trigger a rebuild and the extra disk activity increases

the likelihood of additional drive failures, correct?

--

Adam Carheden

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com