On 24/11/18 09:04, ningt0509@xxxxxxxxx wrote:
There are four hosts in the environment, the storage pool use EC 4+2, and the Crush rule is configured to select two osds from each host. When I shut down one host, all osds are marked as out state, but PG cannot restore active+clean. Why PG cannot map OSD on another host, Is there a problem with this situation?
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 30.00000 root default
-5 7.00000 host host0
0 ssd 1.00000 osd.0 down 0 1.00000
1 ssd 1.00000 osd.1 down 0 1.00000
2 ssd 1.00000 osd.2 down 0 1.00000
3 ssd 1.00000 osd.3 down 0 1.00000
4 ssd 1.00000 osd.4 down 0 1.00000
5 ssd 1.00000 osd.5 down 0 1.00000
6 ssd 1.00000 osd.6 down 0 1.00000
-7 7.00000 host host1
7 ssd 1.00000 osd.7 up 1.00000 1.00000
8 ssd 1.00000 osd.8 up 1.00000 1.00000
9 ssd 1.00000 osd.9 up 1.00000 1.00000
10 ssd 1.00000 osd.10 up 1.00000 1.00000
11 ssd 1.00000 osd.11 up 1.00000 1.00000
12 ssd 1.00000 osd.12 up 1.00000 1.00000
13 ssd 1.00000 osd.13 up 1.00000 1.00000
-9 8.00000 host host2
14 ssd 1.00000 osd.14 up 1.00000 1.00000
15 ssd 1.00000 osd.15 up 1.00000 1.00000
16 ssd 1.00000 osd.16 up 1.00000 1.00000
17 ssd 1.00000 osd.17 up 1.00000 1.00000
18 ssd 1.00000 osd.18 up 1.00000 1.00000
19 ssd 1.00000 osd.19 up 1.00000 1.00000
20 ssd 1.00000 osd.20 up 1.00000 1.00000
21 ssd 1.00000 osd.21 up 1.00000 1.00000
-11 8.00000 host host3
29 1.00000 osd.29 up 1.00000 1.00000
22 ssd 1.00000 osd.22 up 1.00000 1.00000
23 ssd 1.00000 osd.23 up 1.00000 1.00000
24 ssd 1.00000 osd.24 up 1.00000 1.00000
25 ssd 1.00000 osd.25 up 1.00000 1.00000
26 ssd 1.00000 osd.26 up 1.00000 1.00000
27 ssd 1.00000 osd.27 up 1.00000 1.00000
28 ssd 1.00000 osd.28 up 1.00000 1.00000
cluster:
id: d24174ae-a1bf-43f9-a8f3-a10246988ab7
health: HEALTH_WARN
Reduced data availability: 413 pgs inactive
Degraded data redundancy: 414 pgs undersized
services:
mon: 1 daemons, quorum a
mgr: x(active)
osd: 30 osds: 23 up, 23 in; 3 remapped pgs
data:
pools: 1 pools, 512 pgs
objects: 0 objects, 0 bytes
usage: 24026 MB used, 206 GB / 230 GB avail
pgs: 80.664% pgs not active
413 undersized+peered
96 active+clean
2 active+clean+remapped
1 active+undersized+remapped
The Ceph environment configuration is as follows:
Crush rule:
rule ec_4_2 {
id 1
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 400
step take default
step choose indep 0 type host
step chooseleaf indep 2 type osd
step emit
}
Pool:
pool 1 'ec_4_2' erasure size 6 min_size 5 origin_min_size 0 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 last_change 94 flags hashpspool stripe_width 16384
--------------
ningt0509@xxxxxxxxx
Try setting your pool min_size to temporarily 4 rather than 5 to kick
start the recovery.
I believe this is a feature/bug that EC pools require min_size of pool
chunks to start recovery rather than k chunks.
Maged