Fixed all active+remapped PGs stuck forever (but I have no clue why)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 08/18/2014 12:13 PM, John Morris wrote:
>
> On 08/14/2014 02:35 AM, Christian Balzer wrote:
>>
>> The default (firefly, but previous ones are functionally identical) crush
>> map has:
>> ---
>> # rules
>> rule replicated_ruleset {
>>          ruleset 0
>>          type replicated
>>          min_size 1
>>          max_size 10
>>          step take default
>>          step chooseleaf firstn 0 type host
>>          step emit
>> }
>> ---
>>
>> The type host states that there will be not more that one replica per
>> host
>> (node), so with size=3 you will need at least 3 hosts to choose from.
>> If you were to change this to to type OSD, all 3 replicas could wind
>> up on
>> the same host, not really a good idea.
>
> Ah, this is a great clue.  (On my cluster, the default rule contains
> 'step choose firstn 0 type osd', and thus has the problem you hint at
> here.)
>
> So I played with a new rule set with the buckets 'root', 'rack', 'host',
> 'bank' and 'osd', of which 'rack' and 'host' are unused.  The 'bank'
> bucket:  the OSD nodes each contain two 'banks' of disks with a separate
> disk controller channel, a separate power supply cable, and a separate
> SSD.  Thus, 'bank' actually does represent a real failure domain.  More
> importantly, this provides a bucket level below 'osd' that is big enough
> for 3-4 replicas.  Here's the rule:
>
> rule by_bank {
>          ruleset 3
>          type replicated
>          min_size 3
>          max_size 4
>          step take default
>          step choose firstn 0 type bank
>          step choose firstn 0 type osd
>          step emit
> }

Ah, with the 'legacy' tunables, the 'chooseleaf' step in the above rule 
generates bad mappings.  But by injecting tunables into the map 
(recommended in the below link), the rule can be shortened to the following:

rule by_bank {
         ruleset 3
         type replicated
         min_size 3
         max_size 4
         step take default
         step chooseleaf firstn 0 type bank
         step emit
}

See this link:

http://ceph.com/docs/master/rados/operations/crush-map/#tuning-crush-the-hard-way

Below, after re-compiling the new CRUSH map, but before running tests, 
inject the tunables into the binary map, and then run the tests on 
/tmp/crush-new-tuned.bin instead:

crushtool --enable-unsafe-tunables \
   --set-choose-local-tries 0 \
   --set-choose-local-fallback-tries 0 \
   --set-choose-total-tries 50 \
   -i /tmp/crush-new.bin -o /tmp/crush-new-tuned.bin

>
> If the OP (sorry, Craig, you do have a name ;) wants to play with CRUSH
> map rules, here's the quick and dirty of what I did:
>
> # get the current 'orig' CRUSH map, decompile and edit; see:
> #
> http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map
>
> ceph osd getcrushmap -o /tmp/crush-orig.bin
> crushtool -d /tmp/crush-orig.bin -o /tmp/crush.txt
> $EDITOR /tmp/crush.txt
>
> # edit the crush map with your fave editor; see:
> # http://ceph.com/docs/master/rados/operations/crush-map
> #
> # in my case, I added the bank type:
>
> type 0 osd
> type 1 bank
> type 2 host
> type 3 rack
> type 4 root
>
> # the banks (repeat as applicable):
>
> bank bank0 {
>          id -6
>          alg straw
>          hash 0
>          item osd.0 weight 1.000
>          item osd.1 weight 1.000
> }
>
> bank bank1 {
>          id -7
>          alg straw
>          hash 0
>          item osd.2 weight 1.000
>          item osd.3 weight 1.000
> }
>
> # updated the hosts (repeat as applicable):
>
> host host0 {
>          id -4           # do not change unnecessarily
>          # weight 3.000
>          alg straw
>          hash 0  # rjenkins1
>          item bank0 weight 2.000
>          item bank1 weight 2.000
> }
>
> # and added the rule:
>
> rule by_bank {
>          ruleset 3
>          type replicated
>          min_size 3
>          max_size 4
>          step take default
>          step choose firstn 0 type bank
>          step choose firstn 0 type osd
>          step emit
> }
>
> # compile the crush map:
>
> crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin
>
> # and run some tests; the replica sizes tested come from
> # 'min_size' and 'max_size' in the above rule; see:
> # http://ceph.com/docs/master/man/8/crushtool/#running-tests-with-test
> #
> # show sample PG->OSD maps:
>
> crushtool -i /tmp/crush-new.bin --test --show-statistics
>
> # show bad mappings; if the CRUSH map is correct,
> # this should be empty:
>
> crushtool -i /tmp/crush-new.bin --test --show-bad-mappings
>
> # show per-OSD pg utilization:
>
> crushtool -i /tmp/crush-new.bin --test --show-utilization
>
>
>> You might finackle something like that (again the rule splits on
>> hosts) by
>> having multiple "hosts" on one physical machine, but therein lies
>> madness.
>
> Well, the bucket names can be changed, as above, and Sage hints at doing
> something like this here:
>
> http://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language
>
>
> (And IIUC he also proposes something to implement my original
> intentions:  distribute four replicas, two on each of two racks, and
> don't put two replicas on the same host within a rack; this is more
> easily generalized than the above funky configuration.)
>
>      John


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux