Re: Ceph Down on Cluster

Bruno Silva <bemanuel.pe@xxxxxxxxx> · Sat, 19 Nov 2016 10:59:40 +0000

So, I did it now. And removed another one.ceph health detail
HEALTH_WARN 1 pgs down; 6 pgs incomplete; 6 pgs stuck inactive; 6 pgs stuck unclean; 3 requests are blocked > 32 sec; 2 osds have slow requests
pg 0.3 is stuck inactive for 249715.738300, current state incomplete, last acting [1,4,6]
pg 0.38 is stuck inactive for 249883.557050, current state incomplete, last acting [1,4,2]
pg 0.43 is stuck inactive for 249715.738321, current state incomplete, last acting [2,1,4]
pg 0.78 is stuck inactive since forever, current state incomplete, last acting [6,4,3]
pg 0.27 is stuck inactive for 249732.530015, current state incomplete, last acting [2,4,7]
pg 0.67 is stuck inactive for 249732.530021, current state down+incomplete, last acting [2,1,3]
pg 0.3 is stuck unclean for 249715.738363, current state incomplete, last acting [1,4,6]
pg 0.38 is stuck unclean for 249883.557106, current state incomplete, last acting [1,4,2]
pg 0.43 is stuck unclean for 249715.738376, current state incomplete, last acting [2,1,4]
pg 0.78 is stuck unclean since forever, current state incomplete, last acting [6,4,3]
pg 0.27 is stuck unclean for 250117.492547, current state incomplete, last acting [2,4,7]
pg 0.67 is stuck unclean for 250117.492554, current state down+incomplete, last acting [2,1,3]
pg 0.27 is incomplete, acting [2,4,7]
pg 0.3 is incomplete, acting [1,4,6]
pg 0.78 is incomplete, acting [6,4,3]
pg 0.67 is down+incomplete, acting [2,1,3]
pg 0.43 is incomplete, acting [2,1,4]
pg 0.38 is incomplete, acting [1,4,2]
3 ops are blocked > 67108.9 sec
2 ops are blocked > 67108.9 sec on osd.1
1 ops are blocked > 67108.9 sec on osd.2
2 osds have slow requests

########CRUSH MAP
 begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 device5
device 6 osd.6
device 7 osd.7

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pxm00node01 {
        id -2           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
}
host pmx00node03 {
        id -3           # do not change unnecessarily
        # weight 0.540
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 0.540
}
host pmx00node04 {
        id -5           # do not change unnecessarily
        # weight 0.530
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 0.530
}
host pmx00node01 {
        id -6           # do not change unnecessarily
        # weight 1.080
        alg straw
        hash 0  # rjenkins1
        item osd.6 weight 0.540
        item osd.7 weight 0.540
}
host pmx00node02 {
        id -7           # do not change unnecessarily
        # weight 0.530
        alg straw
        hash 0  # rjenkins1
        item osd.3 weight 0.530
}
host pmx00node05 {
        id -8           # do not change unnecessarily
        # weight 0.530
        alg straw
        hash 0  # rjenkins1
        item osd.4 weight 0.530
}
root default {
        id -1           # do not change unnecessarily
        # weight 3.210
        alg straw
        hash 0  # rjenkins1
        item pxm00node01 weight 0.000
        item pmx00node03 weight 0.540
        item pmx00node04 weight 0.530
        item pmx00node01 weight 1.080
        item pmx00node02 weight 0.530
        item pmx00node05 weight 0.530
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# ceph -w
    cluster 338bc0a5-c2f7-4c0a-9b35-25c7afee50c6
     health HEALTH_WARN
            1 pgs down
            6 pgs incomplete
            6 pgs stuck inactive
            6 pgs stuck unclean
            3 requests are blocked > 32 sec
     monmap e5: 5 mons at {0=172.20.20.10:6789/0,1=172.20.20.12:6789/0,2=172.20.20.11:6789/0,3=172.20.20.13:6789/0,4=172.20.20.14:6789/0}
            election epoch 63616, quorum 0,1,2,3,4 0,2,1,3,4
     osdmap e2619: 6 osds: 6 up, 6 in
      pgmap v6114202: 128 pgs, 1 pools, 748 GB data, 188 kobjects
            2217 GB used, 1072 GB / 3290 GB avail
                 122 active+clean
                   5 incomplete
                   1 down+incomplete
  client io 1360 B/s wr, 0 op/s

2016-11-19 07:59:14.109468 mon.0 [INF] pgmap v6114201: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 1228 B/s wr, 0 op/s
2016-11-19 07:59:19.103794 mon.0 [INF] pgmap v6114202: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 1360 B/s wr, 0 op/s
2016-11-19 07:59:24.067300 mon.0 [INF] pgmap v6114203: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 818 B/s wr, 0 op/s
2016-11-19 07:59:25.293723 mon.0 [INF] pgmap v6114204: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 674 B/s wr, 0 op/s
2016-11-19 07:59:29.131633 mon.0 [INF] pgmap v6114205: 128 pgs: 1 down+incomplete, 122 active+clean, 5 incomplete; 748 GB data, 2217 GB used, 1072 GB / 3290 GB avail; 818 B/s wr, 0 op/s

Em sáb, 19 de nov de 2016 às 06:28, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> escreveu:
Your osd.0 was removed from the cluster but not from the crush map. The DNE state means 'does not exist'. I would start by clean it up from the crush map (if you are sure it is no longer active) and start debugging on the basis of a cleaner map

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com