Thanks for your replies. I think the sort version is "guaranteed": CEPH will always either store 'size' copies of your data or set heath to a WARN and/or ERR state to let you know that it can't. I think that's probably the most desirable answer. -- Adam Carheden On 04/14/2017 09:51 AM, David Turner wrote: > If you have Replica size 3, your failure domain is host, and you have 3 > servers... you will NEVER have 2 copies of the data on 1 server. If you > weight your OSDs poorly on one of your servers, then one of the drives > will fill up to the full ratio in its config and stop receiving writes. > You should always monitor your OSDs so that you can fix the weights > before an OSD becomes nearfull and definitely so that the OSD never > reaches the FULL setting and stops receiving writes. Note that when it > stops receiving writes, it will block the write requests and until it > has space to fulfill the write and the cluster will be stuck. > > Also to truly answer your question, if you had Replica size 3, your > failure domain is host, and you only have 2 servers in your cluster... > You will only be storing 2 copies of data and every single PG in your > cluster will be degraded. Ceph will never breach the boundary of your > failure domain. > > When dealing with 3 node clusters you want to be careful to never fill > up your cluster past a % where you can lose a drive in one of your > nodes. For example, if you have 3 nodes with 3x 4TB drives in each and > you lose a drive... the other 2 OSDs in that node need to be able to > take the data from the dead drive without going over 80% (the default > nearfull setting). So in this scenario you shouldn't fill the cluster > to be more than 53% unless you're planning to tell the cluster not to > backfill until the dead OSD is replaced. > > I will never recommend anyone to go into production with a cluster > smaller than N+2 your replica size of failure domains. So if you have > the default Replica size of 3, then you should go into production with > at least 5 servers. This gives you enough failure domains to be able to > handle drive failures without the situation being critical. > > On Fri, Apr 14, 2017 at 11:25 AM Adam Carheden <carheden@xxxxxxxx > <mailto:carheden@xxxxxxxx>> wrote: > > Is redundancy across failure domains guaranteed or best effort? > > Note: The best answer to the questions below is obviously to avoid the > situation by properly weight drives and not approaching the full ratio. > I'm just curious how CEPH works. > > Hypothetical situation: > Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you > weighted the OSDs poorly such that the OSDs on one server filled up but > the OSDs on the others still had space. CEPH could still store 3 > replicas of your data, but two of them would be on the same server. What > happens? > > (select all that apply) > a.[ ] Clients can still read data > b.[ ] Clients can still write data > c.[ ] health = HEALTH_WARN > d.[ ] health = HEALTH_OK > e.[ ] PGs are degraded > f.[ ] ceph stores only two copies of data > g.[ ] ceph stores 3 copies of data, two of which are on the same server > h.[ ] something else? > > If the answer is "best effort" (a+b+d+g), how would you detect if that > scenario is occurring? > > If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in > that scenario, is there any way to tell CEPH to store temporarily store > 2 copies on a single server just in case? I suspect the answer is to > remove host bucket from the crushmap but that that's a really bad idea > because it would trigger a rebuild and the extra disk activity increases > the likelihood of additional drive failures, correct? > > -- > Adam Carheden > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com