Hi, Apologies that this is a fairly long post, but hopefully all my questions are similar (or even invalid!) Does Ceph allow writes to proceed if it's not possible to satisfy the rules for replica placement across failure domains, as specified in the CRUSH map? For example, if my CRUSH map says to place one replica on each of 2 hosts, and no devices are up on one of the hosts, what will happen? >From tests on a small cluster, my finding is that Ceph will allow writes to proceed in this case, even when there are less OSDs "in" the cluster than the replication size. If this is the expected behaviour of Ceph, then it seems to prefer write-availability over read-availability (in this case my data is only stored on 1 OSD, thus a SPOF). Is there any way to change this trade-off, e.g. as you can in Cassandra with its write quorums? >From reading the paper which details the CRUSH replica placement algorithm, I understand several concepts: - The CRUSH algorithm loops over n replicas, for each descending from items in the current set (buckets) until it finds a device - "CRUSH may reject and reselect items using a modified input for three different reasons: if an item has already been selected in the current set (a collision..., if a device is failed, or if a device is overloaded." I'm not clear on what will happen if these constraints cannot be met, say in the case mentioned above. Does Ceph store the object once only, not meeting the replica size, or does it store the object twice on the same OSD somehow? The latter would violate point 2 above, unless the word "reselect" is appropriate here. Now, if any details are required (no need to read if my questions are already answered) ... I'm running some tests with 2 hosts, each running only 1 OSD. There is one monitor running on one of the hosts. I have the default CRUSH map and performing writes via RGW. My understanding of the following default 'data' crush rule is that Ceph will place 1 copy on each of the 2 OSDs: device 0 osd.0 device 1 osd.1 type 0 osd type 1 host host squeezeceph1 { id -2 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } host squeezeceph2 { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 } rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } I'm using RGW which has a replication size of 2: root@squeezeceph1:~/crushmaps# ceph osd dump | grep 'rep size' ... pool 3 '.rgw' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 9 owner 18446744073709551615 ... All seems well when the cluster is healthy - any particular PG in my RGW pool resolves to the two OSD IDs: root@squeezeceph1:~/crushmaps# ceph osd map .rgw someobjectid osdmap e52 pool '.rgw' (3) object 'someobjectid' -> pg 3.f31055a1 (3.1) -> up [0,1] acting [0,1] Now I bring down osd.1, such that PGs are reported as degraded: root@squeezeceph1:~/crushmaps# ceph health HEALTH_WARN 632 pgs degraded; 604 pgs stuck unclean; recovery 100/200 degraded (50.000%); 1/2 in osds are down Now the same PG is mapped as follows: root@squeezeceph1:~/crushmaps# ceph osd map .rgw someobjectid osdmap e54 pool '.rgw' (3) object 'someobjectid' -> pg 3.f31055a1 (3.1) -> up [0] acting [0] This shows only osd.0 occupying the PG. At this point osd.1 is down but still 'in' (as reported by "ceph osd tree"). At this point writes to the RGW succeed which is expected given the statement in the documentation "If an OSD goes down, Ceph marks each placement group assigned to the OSD as degraded ... a client can still write a new object to a degraded placement group if it is active." I could close this gap as much as possible by reducing the "mon osd down out interval" from 300 to 0. After the 5 minutes have passed, osd.1 is out of the cluster (as reported by "ceph osd dump"). The command "ceph osd map .rgw someobjectid" from above returns the same result (the PG is mapped to only 1 OSD which is fair enough). At this point writes still succeed; for my use-case they'd ideally fail. Is osd.1 still considered 'active' at this point? Many thanks! Ben -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html