On 08/14/2014 02:35 AM, Christian Balzer wrote: > > The default (firefly, but previous ones are functionally identical) crush > map has: > --- > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > --- > > The type host states that there will be not more that one replica per host > (node), so with size=3 you will need at least 3 hosts to choose from. > If you were to change this to to type OSD, all 3 replicas could wind up on > the same host, not really a good idea. Ah, this is a great clue. (On my cluster, the default rule contains 'step choose firstn 0 type osd', and thus has the problem you hint at here.) So I played with a new rule set with the buckets 'root', 'rack', 'host', 'bank' and 'osd', of which 'rack' and 'host' are unused. The 'bank' bucket: the OSD nodes each contain two 'banks' of disks with a separate disk controller channel, a separate power supply cable, and a separate SSD. Thus, 'bank' actually does represent a real failure domain. More importantly, this provides a bucket level below 'osd' that is big enough for 3-4 replicas. Here's the rule: rule by_bank { ruleset 3 type replicated min_size 3 max_size 4 step take default step choose firstn 0 type bank step choose firstn 0 type osd step emit } If the OP (sorry, Craig, you do have a name ;) wants to play with CRUSH map rules, here's the quick and dirty of what I did: # get the current 'orig' CRUSH map, decompile and edit; see: # http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map ceph osd getcrushmap -o /tmp/crush-orig.bin crushtool -d /tmp/crush-orig.bin -o /tmp/crush.txt $EDITOR /tmp/crush.txt # edit the crush map with your fave editor; see: # http://ceph.com/docs/master/rados/operations/crush-map # # in my case, I added the bank type: type 0 osd type 1 bank type 2 host type 3 rack type 4 root # the banks (repeat as applicable): bank bank0 { id -6 alg straw hash 0 item osd.0 weight 1.000 item osd.1 weight 1.000 } bank bank1 { id -7 alg straw hash 0 item osd.2 weight 1.000 item osd.3 weight 1.000 } # updated the hosts (repeat as applicable): host host0 { id -4 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item bank0 weight 2.000 item bank1 weight 2.000 } # and added the rule: rule by_bank { ruleset 3 type replicated min_size 3 max_size 4 step take default step choose firstn 0 type bank step choose firstn 0 type osd step emit } # compile the crush map: crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin # and run some tests; the replica sizes tested come from # 'min_size' and 'max_size' in the above rule; see: # http://ceph.com/docs/master/man/8/crushtool/#running-tests-with-test # # show sample PG->OSD maps: crushtool -i /tmp/crush-new.bin --test --show-statistics # show bad mappings; if the CRUSH map is correct, # this should be empty: crushtool -i /tmp/crush-new.bin --test --show-bad-mappings # show per-OSD pg utilization: crushtool -i /tmp/crush-new.bin --test --show-utilization > You might finackle something like that (again the rule splits on hosts) by > having multiple "hosts" on one physical machine, but therein lies madness. Well, the bucket names can be changed, as above, and Sage hints at doing something like this here: http://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language (And IIUC he also proposes something to implement my original intentions: distribute four replicas, two on each of two racks, and don't put two replicas on the same host within a rack; this is more easily generalized than the above funky configuration.) John