Re: Some questions about crush_choose_indep

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 28 Nov 2018 13:08:10 +0000 (UTC)

On Wed, 28 Nov 2018, ningt0509@xxxxxxxxx wrote:
> I configured two environments.
> 
> 1. First environment:
> Four hosts, one EC storage pool, k=4,m=2, Crush rules are as follows:
> Crush rule:
> rule ec_4_2 {
>        id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 5
>         step set_choose_tries 400
>         step take default
>         step choose indep 0 type host
>         step chooseleaf indep 2 type osd
>         step emit
> }
> 
> When I shut down one of the hosts and waited for OSD of the corresponding host to be marked out,
> PG could not restore the active+clean state

This is a design limitation of the way CRUSH rules are currently 
implemented.  The first 'step choose indep 0 type host' step is done 
blindly, without considering/noticing that all of the OSDs on host0 are 
down and host0 must be avoided.  (That is the primary difference between 
choose and chooseleaf.)

Currently, with an k=4,m=2 erasure code, you'll need 7 hosts to tolerate a 
host failure.

sage

> 
> ID  CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF 
>  -1       12.00000 root default                           
>  -5        3.00000     host host0                         
>   0   ssd  1.00000         osd.0    down        0 1.00000 
>   1   ssd  1.00000         osd.1    down        0 1.00000 
>   2   ssd  1.00000         osd.2    down        0 1.00000 
>  -7        3.00000     host host1                         
>   3   ssd  1.00000         osd.3      up  1.00000 1.00000 
>   4   ssd  1.00000         osd.4      up  1.00000 1.00000 
>   5   ssd  1.00000         osd.5      up  1.00000 1.00000 
>  -9        3.00000     host host2                         
>   6   ssd  1.00000         osd.6      up  1.00000 1.00000 
>   7   ssd  1.00000         osd.7      up  1.00000 1.00000 
>   8   ssd  1.00000         osd.8      up  1.00000 1.00000 
> -11        3.00000     host host3                         
>   9   ssd  1.00000         osd.9      up  1.00000 1.00000 
>  10   ssd  1.00000         osd.10     up  1.00000 1.00000 
>  11   ssd  1.00000         osd.11     up  1.00000 1.00000 
>   cluster:
>     id:     5e527773-9873-4100-bcce-19a1eaf6e496
>     health: HEALTH_OK
>  
>   services:
>     mon: 1 daemons, quorum a
>     mgr: x(active)
>     osd: 12 osds: 9 up, 9 in
>  
>   data:
>     pools:   1 pools, 32 pgs
>     objects: 0 objects, 0 bytes
>     usage:   9238 MB used, 82921 MB / 92160 MB avail
>     pgs:     26 active+undersized
>              6  active+clean
> 
> 2. Second environment
> Eight hosts, EC storage pool, k=4,m=2, Crush rules are as follows
> rule ec_4_2 {
>        id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 5
>         step set_choose_tries 400
>         step take default
>         step chooseleaf indep 0 type host
>         step emit
> }
> After I shut down one host and waited for OSD on the corresponding host to be marked out, PG could restore active+clean
> If I change the Crush rule to something like this:
> rule ec_4_2 {
>        id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 5
>         step set_choose_tries 400
>         step take default
>         step choose indep 0 type host
>         step chooseleaf indep 1 type osd
>         step emit
> }
> 
> 
> PG could not recover active+clean after one of the hosts was down
> 
> Analyze the code for the first configuration，After OSD under one of the hosts is marked out, that host will still be elected as crush_choose_indep input,
> The second configuration does not，Is there any good way to handle such a scenario？
> crush_do_rule()
> {
> 	...
> 	out_size = ((numrep < (result_max-osize)) ? numrep : (result_max-osize));
> 	crush_choose_indep(
> 		map,
> 		cw,
> 		map->buckets[bno],
> 		weight, weight_max,
> 		x, out_size, numrep,
> 		curstep->arg2,
> 		o+osize, j,
> 		choose_tries,
> 		choose_leaf_tries ? choose_leaf_tries : 1,
> 		recurse_to_leaf,
> 		c+osize,
> 		0,
> 		choose_args);
> 		osize += out_size;
> ...
> }
> 					
> 
> --------------
> ningt0509@xxxxxxxxx