All very true and worth considering, but I feel compelled to mention the strategy of setting mon_osd_down_out_subtree_limit carefully to prevent automatic rebalancing. *If* the loss of a failure domain is temporary, ie. something you can fix fairly quickly, it can be preferable to not start that avalanche of recovery, both to avoid contention with client workloads and also for the fillage factor that David describes. If of course the loss of the failure domain can’t be corrected quickly, then one would still be in a quandary re whether to shift the capacity onto the surviving failure domains or take the risk of reduced redundancy while the problem is worked. That said, I’ve seen situations where OSD’s in a failure domain weren’t reported down in close enough temporal proximity, and the subtree limit didn’t kick in. In my current situation we’re already planning to exploit the half-rack strategy you describe for EC clusters, it improves the failure domain situation without being monopolizing as many DC racks. — aad > The problem with having 3 failure domains with replica 3 is that if you > lose a complete failure domain, then you have nowhere for the 3rd replica > to go. If you have 4 failure domains with replica 3 and you lose an entire > failure domain, then you over fill the remaining 3 failure domains and can > only really use 55% of your cluster capacity. If you have 5 failure > domains, then you start normalizing and losing a failure domain doesn't > impact as severely. The more failure domains you get to, the less it > affects you when you lose one. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com