CRUSH question - failing to rebalance after failure test

Christopher Kunz <chrislist@xxxxxxxxxxx> · Mon, 05 Jan 2015 15:16:36 +0100

Hi all,

I think I have a subtle problem with either understanding CRUSH or in
the actual implementation of my CRUSH map.

Consider the following CRUSH map: http://paste.debian.net/hidden/085b3f20/

I have 3 chassis' with 7 nodes each (6 of them OSDs). Size is 3,
min_size is 2 on all pools.
If i remove one chassis from the cluster (pull the network plugs, in
this case), my naive first thought was that the cluster might recover
fully, but I think this cannot be the case since it will never find a
location that can satisfy the necessary conditions "provide three
replicas on different chassis" - as there's only two in operaiton.

However, after setting "size" to 2 on all pools, the cluster recovered
from 33.3% degraded to 20.5% degraded, and is now sitting there.

This is a lab cluster, I'm really only trying to understand what's
happening. Can someone clear that up - I think i'm blind...

Regards,

--ck
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com