On 08/18/2014 02:20 PM, John Morris wrote: > > > On 08/18/2014 01:49 PM, Sage Weil wrote: >> On Mon, 18 Aug 2014, John Morris wrote: >>> rule by_bank { >>> ruleset 3 >>> type replicated >>> min_size 3 >>> max_size 4 >>> step take default >>> step choose firstn 0 type bank >>> step choose firstn 0 type osd >>> step emit >>> } >> >> You probably want: >> >> step choose firstn 0 type bank >> step choose firstn 1 type osd >> >> I.e., 3 (or 4) banks, and 1 osd in each.. not 3 banks with 3 osds in each >> or 4 banks with 4 osds in each (for a total of 9 or 16 OSDs). > > Yes, thanks. Funny, testing still works with the incorrect version, and > the --show-utilization test results look similar. > > In re. to my last email about tunables, those can also be expressed in > the human-readable map as such: > > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 Wrapping up this exercise: This little script helps to see exactly where things go, and show what goes wrong with my original, incorrect map. #!/bin/bash echo "compiling crush map" crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin \ --enable-unsafe-tunables bad="$(crushtool -i /tmp/crush2-new.bin --test \ --show-bad-mappings 2>&1 | \ wc -l)" echo "number of bad mappings: $bad" distribution() { crushtool -i /tmp/crush2-new.bin --test --show-statistics \ --num-rep $1 2>&1 | \ awk '/\[.*\]/ { gsub("[][]","",$6); split($6,a,","); asort(a,d); print d[1], d[2], d[3], d[4]; }' | \ sort | uniq -c } for i in 3 4; do echo "distribution of size=${i} replicas:" distribution $i done For --num-rep=4, the result looks like the following; it's easily seen that two sets of OSDs in the same bank are always picked, exactly what we do NOT want (note OSDs 0+1 in bank0, 1+2 in bank1, etc.): 173 0 1 2 3 176 0 1 4 5 184 0 1 6 7 171 2 3 4 5 156 2 3 6 7 164 4 5 6 7 After Sage's correction, the result looks like the following, with one OSD from each bank: 70 0 2 4 6 74 0 2 4 7 65 0 2 5 6 58 0 2 5 7 60 0 3 4 6 72 0 3 4 7 80 0 3 5 6 64 0 3 5 7 48 1 2 4 6 66 1 2 4 7 72 1 2 5 6 46 1 2 5 7 73 1 3 4 6 70 1 3 4 7 51 1 3 5 6 55 1 3 5 7 When replicas=3, the result is also correct. So this is a bit of a hack, but it does seem to work to evenly distribute 3-4 replicas across a bucket level with only two nodes. Late into this exploration, it appears that if the 'bank' layer is undesirable, this also works to distribute evenly across hosts: step choose firstn 0 type host step choose firstn 2 type osd In conclusion, this example doesn't seem so far-fetched, since it's easy to imagine wanting to distribute OSDs across two racks, or PDUs, or data centers, where it's not so unreasonable to say a third is out of the budget. John