Fixed all active+remapped PGs stuck forever (but I have no clue why)

john@xxxxxxxxxxx (John Morris) · Mon, 18 Aug 2014 12:13:51 -0500

On 08/14/2014 02:35 AM, Christian Balzer wrote:
>
> The default (firefly, but previous ones are functionally identical) crush
> map has:
> ---
> # rules
> rule replicated_ruleset {
>          ruleset 0
>          type replicated
>          min_size 1
>          max_size 10
>          step take default
>          step chooseleaf firstn 0 type host
>          step emit
> }
> ---
>
> The type host states that there will be not more that one replica per host
> (node), so with size=3 you will need at least 3 hosts to choose from.
> If you were to change this to to type OSD, all 3 replicas could wind up on
> the same host, not really a good idea.

Ah, this is a great clue.  (On my cluster, the default rule contains 
'step choose firstn 0 type osd', and thus has the problem you hint at here.)

So I played with a new rule set with the buckets 'root', 'rack', 'host', 
'bank' and 'osd', of which 'rack' and 'host' are unused.  The 'bank' 
bucket:  the OSD nodes each contain two 'banks' of disks with a separate 
disk controller channel, a separate power supply cable, and a separate 
SSD.  Thus, 'bank' actually does represent a real failure domain.  More 
importantly, this provides a bucket level below 'osd' that is big enough 
for 3-4 replicas.  Here's the rule:

rule by_bank {
         ruleset 3
         type replicated
         min_size 3
         max_size 4
         step take default
         step choose firstn 0 type bank
         step choose firstn 0 type osd
         step emit
}

If the OP (sorry, Craig, you do have a name ;) wants to play with CRUSH 
map rules, here's the quick and dirty of what I did:

# get the current 'orig' CRUSH map, decompile and edit; see:
# 
http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map

ceph osd getcrushmap -o /tmp/crush-orig.bin
crushtool -d /tmp/crush-orig.bin -o /tmp/crush.txt
$EDITOR /tmp/crush.txt

# edit the crush map with your fave editor; see:
# http://ceph.com/docs/master/rados/operations/crush-map
#
# in my case, I added the bank type:

type 0 osd
type 1 bank
type 2 host
type 3 rack
type 4 root

# the banks (repeat as applicable):

bank bank0 {
         id -6
         alg straw
         hash 0
         item osd.0 weight 1.000
         item osd.1 weight 1.000
}

bank bank1 {
         id -7
         alg straw
         hash 0
         item osd.2 weight 1.000
         item osd.3 weight 1.000
}

# updated the hosts (repeat as applicable):

host host0 {
         id -4           # do not change unnecessarily
         # weight 3.000
         alg straw
         hash 0  # rjenkins1
         item bank0 weight 2.000
         item bank1 weight 2.000
}

# and added the rule:

rule by_bank {
         ruleset 3
         type replicated
         min_size 3
         max_size 4
         step take default
         step choose firstn 0 type bank
         step choose firstn 0 type osd
         step emit
}

# compile the crush map:

crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin

# and run some tests; the replica sizes tested come from
# 'min_size' and 'max_size' in the above rule; see:
# http://ceph.com/docs/master/man/8/crushtool/#running-tests-with-test
#
# show sample PG->OSD maps:

crushtool -i /tmp/crush-new.bin --test --show-statistics

# show bad mappings; if the CRUSH map is correct,
# this should be empty:

crushtool -i /tmp/crush-new.bin --test --show-bad-mappings

# show per-OSD pg utilization:

crushtool -i /tmp/crush-new.bin --test --show-utilization

> You might finackle something like that (again the rule splits on hosts) by
> having multiple "hosts" on one physical machine, but therein lies madness.

Well, the bucket names can be changed, as above, and Sage hints at doing 
something like this here:

http://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language

(And IIUC he also proposes something to implement my original 
intentions:  distribute four replicas, two on each of two racks, and 
don't put two replicas on the same host within a rack; this is more 
easily generalized than the above funky configuration.)

	John