On Wed, 28 Nov 2018, ningt0509@xxxxxxxxx wrote: > I configured two environments. > > 1. First environment: > Four hosts, one EC storage pool, k=4,m=2, Crush rules are as follows: > Crush rule: > rule ec_4_2 { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 5 > step set_choose_tries 400 > step take default > step choose indep 0 type host > step chooseleaf indep 2 type osd > step emit > } > > When I shut down one of the hosts and waited for OSD of the corresponding host to be marked out, > PG could not restore the active+clean state This is a design limitation of the way CRUSH rules are currently implemented. The first 'step choose indep 0 type host' step is done blindly, without considering/noticing that all of the OSDs on host0 are down and host0 must be avoided. (That is the primary difference between choose and chooseleaf.) Currently, with an k=4,m=2 erasure code, you'll need 7 hosts to tolerate a host failure. sage > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 12.00000 root default > -5 3.00000 host host0 > 0 ssd 1.00000 osd.0 down 0 1.00000 > 1 ssd 1.00000 osd.1 down 0 1.00000 > 2 ssd 1.00000 osd.2 down 0 1.00000 > -7 3.00000 host host1 > 3 ssd 1.00000 osd.3 up 1.00000 1.00000 > 4 ssd 1.00000 osd.4 up 1.00000 1.00000 > 5 ssd 1.00000 osd.5 up 1.00000 1.00000 > -9 3.00000 host host2 > 6 ssd 1.00000 osd.6 up 1.00000 1.00000 > 7 ssd 1.00000 osd.7 up 1.00000 1.00000 > 8 ssd 1.00000 osd.8 up 1.00000 1.00000 > -11 3.00000 host host3 > 9 ssd 1.00000 osd.9 up 1.00000 1.00000 > 10 ssd 1.00000 osd.10 up 1.00000 1.00000 > 11 ssd 1.00000 osd.11 up 1.00000 1.00000 > cluster: > id: 5e527773-9873-4100-bcce-19a1eaf6e496 > health: HEALTH_OK > > services: > mon: 1 daemons, quorum a > mgr: x(active) > osd: 12 osds: 9 up, 9 in > > data: > pools: 1 pools, 32 pgs > objects: 0 objects, 0 bytes > usage: 9238 MB used, 82921 MB / 92160 MB avail > pgs: 26 active+undersized > 6 active+clean > > 2. Second environment > Eight hosts, EC storage pool, k=4,m=2, Crush rules are as follows > rule ec_4_2 { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 5 > step set_choose_tries 400 > step take default > step chooseleaf indep 0 type host > step emit > } > After I shut down one host and waited for OSD on the corresponding host to be marked out, PG could restore active+clean > If I change the Crush rule to something like this: > rule ec_4_2 { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 5 > step set_choose_tries 400 > step take default > step choose indep 0 type host > step chooseleaf indep 1 type osd > step emit > } > > > PG could not recover active+clean after one of the hosts was down > > Analyze the code for the first configuration,After OSD under one of the hosts is marked out, that host will still be elected as crush_choose_indep input, > The second configuration does not,Is there any good way to handle such a scenario? > crush_do_rule() > { > ... > out_size = ((numrep < (result_max-osize)) ? numrep : (result_max-osize)); > crush_choose_indep( > map, > cw, > map->buckets[bno], > weight, weight_max, > x, out_size, numrep, > curstep->arg2, > o+osize, j, > choose_tries, > choose_leaf_tries ? choose_leaf_tries : 1, > recurse_to_leaf, > c+osize, > 0, > choose_args); > osize += out_size; > ... > } > > > -------------- > ningt0509@xxxxxxxxx