So, I did it now. And removed another one.
ceph health detail
HEALTH_WARN 1 pgs down; 6 pgs incomplete; 6 pgs stuck inactive; 6 pgs stuck unclean; 3 requests are blocked > 32 sec; 2 osds have slow requests
pg 0.3 is stuck inactive for 249715.738300, current state incomplete, last acting [1,4,6]
pg 0.38 is stuck inactive for 249883.557050, current state incomplete, last acting [1,4,2]
pg 0.43 is stuck inactive for 249715.738321, current state incomplete, last acting [2,1,4]
pg 0.78 is stuck inactive since forever, current state incomplete, last acting [6,4,3]
pg 0.27 is stuck inactive for 249732.530015, current state incomplete, last acting [2,4,7]
pg 0.67 is stuck inactive for 249732.530021, current state down+incomplete, last acting [2,1,3]
pg 0.3 is stuck unclean for 249715.738363, current state incomplete, last acting [1,4,6]
pg 0.38 is stuck unclean for 249883.557106, current state incomplete, last acting [1,4,2]
pg 0.43 is stuck unclean for 249715.738376, current state incomplete, last acting [2,1,4]
pg 0.78 is stuck unclean since forever, current state incomplete, last acting [6,4,3]
pg 0.27 is stuck unclean for 250117.492547, current state incomplete, last acting [2,4,7]
pg 0.67 is stuck unclean for 250117.492554, current state down+incomplete, last acting [2,1,3]
pg 0.27 is incomplete, acting [2,4,7]
pg 0.3 is incomplete, acting [1,4,6]
pg 0.78 is incomplete, acting [6,4,3]
pg 0.67 is down+incomplete, acting [2,1,3]
pg 0.43 is incomplete, acting [2,1,4]
pg 0.38 is incomplete, acting [1,4,2]
3 ops are blocked > 67108.9 sec
2 ops are blocked > 67108.9 sec on osd.1
1 ops are blocked > 67108.9 sec on osd.2
2 osds have slow requests
########CRUSH MAP
begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 device5
device 6 osd.6
device 7 osd.7
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host pxm00node01 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
host pmx00node03 {
id -3 # do not change unnecessarily
# weight 0.540
alg straw
hash 0 # rjenkins1
item osd.1 weight 0.540
}
host pmx00node04 {
id -5 # do not change unnecessarily
# weight 0.530
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.530
}
host pmx00node01 {
id -6 # do not change unnecessarily
# weight 1.080
alg straw
hash 0 # rjenkins1
item osd.6 weight 0.540
item osd.7 weight 0.540
}
host pmx00node02 {
id -7 # do not change unnecessarily
# weight 0.530
alg straw
hash 0 # rjenkins1
item osd.3 weight 0.530
}
host pmx00node05 {
id -8 # do not change unnecessarily
# weight 0.530
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.530
}
root default {
id -1 # do not change unnecessarily
# weight 3.210
alg straw
hash 0 # rjenkins1
item pxm00node01 weight 0.000
item pmx00node03 weight 0.540
item pmx00node04 weight 0.530
item pmx00node01 weight 1.080
item pmx00node02 weight 0.530
item pmx00node05 weight 0.530
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# ceph -w
cluster 338bc0a5-c2f7-4c0a-9b35-25c7afee50c6
health HEALTH_WARN
1 pgs down
6 pgs incomplete
6 pgs stuck inactive
6 pgs stuck unclean
3 requests are blocked > 32 sec
monmap e5: 5 mons at {0=172.20.20.10:6789/0,1=172.20.20.12:6789/0,2=172.20.20.11:6789/0,3=172.20.20.13:6789/0,4=172.20.20.14:6789/0}
election epoch 63616, quorum 0,1,2,3,4 0,2,1,3,4
osdmap e2619: 6 osds: 6 up, 6 in
pgmap v6114202: 128 pgs, 1 pools, 748 GB data, 188 kobjects
2217 GB used, 1072 GB / 3290 GB avail
122 active+clean
5 incomplete
1 down+incomplete
client io 1360 B/s wr, 0 op/s
2016-11-19 07:59:14.109468 mon.0 [INF] pgmap v6114201: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 1228 B/s wr, 0 op/s
2016-11-19 07:59:19.103794 mon.0 [INF] pgmap v6114202: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 1360 B/s wr, 0 op/s
2016-11-19 07:59:24.067300 mon.0 [INF] pgmap v6114203: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 818 B/s wr, 0 op/s
2016-11-19 07:59:25.293723 mon.0 [INF] pgmap v6114204: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 674 B/s wr, 0 op/s
2016-11-19 07:59:29.131633 mon.0 [INF] pgmap v6114205: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 818 B/s wr, 0 op/s
Em sáb, 19 de nov de 2016 às 06:28, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> escreveu:
Your osd.0 was removed from the cluster but not from the crush map. The DNE state means 'does not exist'. I would start by clean it up from the crush map (if you are sure it is no longer active) and start debugging on the basis of a cleaner map
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com