Fixed all active+remapped PGs stuck forever (but I have no clue why)

john@xxxxxxxxxxx (John Morris) · Mon, 18 Aug 2014 16:13:48 -0500

On 08/18/2014 02:20 PM, John Morris wrote:
>
>
> On 08/18/2014 01:49 PM, Sage Weil wrote:
>> On Mon, 18 Aug 2014, John Morris wrote:
>>> rule by_bank {
>>>          ruleset 3
>>>          type replicated
>>>          min_size 3
>>>          max_size 4
>>>          step take default
>>>          step choose firstn 0 type bank
>>>          step choose firstn 0 type osd
>>>          step emit
>>> }
>>
>> You probably want:
>>
>>           step choose firstn 0 type bank
>>           step choose firstn 1 type osd
>>
>> I.e., 3 (or 4) banks, and 1 osd in each.. not 3 banks with 3 osds in each
>> or 4 banks with 4 osds in each (for a total of 9 or 16 OSDs).
>
> Yes, thanks.  Funny, testing still works with the incorrect version, and
> the --show-utilization test results look similar.
>
> In re. to my last email about tunables, those can also be expressed in
> the human-readable map as such:
>
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50

Wrapping up this exercise:

This little script helps to see exactly where things go, and show what 
goes wrong with my original, incorrect map.

#!/bin/bash
echo "compiling crush map"
crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin \
     --enable-unsafe-tunables
bad="$(crushtool -i /tmp/crush2-new.bin --test \
         --show-bad-mappings 2>&1 | \
     wc -l)"
echo "number of bad mappings:  $bad"

distribution() {
     crushtool -i /tmp/crush2-new.bin --test --show-statistics \
         --num-rep $1 2>&1 | \
         awk '/\[.*\]/ {
             gsub("[][]","",$6);
             split($6,a,",");
             asort(a,d);
             print d[1], d[2], d[3], d[4]; }' | \
         sort | uniq -c
}
for i in 3 4; do
     echo "distribution of size=${i} replicas:"
     distribution $i
done

For --num-rep=4, the result looks like the following; it's easily seen 
that two sets of OSDs in the same bank are always picked, exactly what 
we do NOT want (note OSDs 0+1 in bank0, 1+2 in bank1, etc.):

     173 0 1 2 3
     176 0 1 4 5
     184 0 1 6 7
     171 2 3 4 5
     156 2 3 6 7
     164 4 5 6 7

After Sage's correction, the result looks like the following, with one 
OSD from each bank:

      70 0 2 4 6
      74 0 2 4 7
      65 0 2 5 6
      58 0 2 5 7
      60 0 3 4 6
      72 0 3 4 7
      80 0 3 5 6
      64 0 3 5 7
      48 1 2 4 6
      66 1 2 4 7
      72 1 2 5 6
      46 1 2 5 7
      73 1 3 4 6
      70 1 3 4 7
      51 1 3 5 6
      55 1 3 5 7

When replicas=3, the result is also correct.

So this is a bit of a hack, but it does seem to work to evenly 
distribute 3-4 replicas across a bucket level with only two nodes.  Late 
into this exploration, it appears that if the 'bank' layer is 
undesirable, this also works to distribute evenly across hosts:

         step choose firstn 0 type host
         step choose firstn 2 type osd

In conclusion, this example doesn't seem so far-fetched, since it's easy 
to imagine wanting to distribute OSDs across two racks, or PDUs, or data 
centers, where it's not so unreasonable to say a third is out of the budget.

	John