Health_Warn recovery stuck / crushmap problem?

Jonas Stunkat <jonas.stunkat@xxxxxxxxxxx> · Tue, 24 Jan 2017 16:42:13 +0100

All OSD´s and Monitors are up from what I can see.

I read through the troubleshooting like mentioned in the ceph documentation for PGs and came to the conclusion that nothing there would help me, so I didn´t try anything - except restarting / rebooting OSD´s and Monitors.

How do I recover from this, it looks to me that the data itself should be safe for now, but why is it not restoring?

I guess the problem may be the crushmap.

Here are some outputs:

#ceph health detail

HEALTH_WARN 475 pgs degraded; 640 pgs stale; 475 pgs stuck degraded; 640 pgs stuck stale; 640 pgs stuck unclean; 475 pgs stuck undersized; 475 pgs undersized; recovery 104812/279550 objects degraded (37.493%); recovery 69926/279550 objects misplaced (25.014%)

pg 3.ec is stuck unclean for 3326815.935321, current state stale+active+remapped, last acting [7,6]

pg 3.ed is stuck unclean for 3288818.682456, current state stale+active+remapped, last acting [6,7]

pg 3.ee is stuck unclean for 409973.052061, current state stale+active+undersized+degraded, last acting [7]

pg 3.ef is stuck unclean for 3357894.554762, current state stale+active+undersized+degraded, last acting [7]

pg 3.e8 is stuck unclean for 384815.518837, current state stale+active+undersized+degraded, last acting [6]

pg 3.e9 is stuck unclean for 3274554.591000, current state stale+active+remapped, last acting [6,7]

......

################################################################################

This is the crushmap I created and intended to use and thought I used for the past 2 months:

- pvestorage1-ssd and pvestorage1-platter are the same hosts, it seems like this is not possible but I never noticed

- likewise with pvestorage2

# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable straw_calc_version 1

# devices

device 0 osd.0

device 1 osd.1

device 2 osd.2

device 3 osd.3

device 4 osd.4

device 5 osd.5

device 6 osd.6

device 7 osd.7

# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root

# buckets

host pvestorage1-ssd {

        id -2   # do not change unnecessarily

        # weight 1.740

        alg straw

        hash 0  # rjenkins1

        item osd.0 weight 0.870

        item osd.1 weight 0.870

}

host pvestorage2-ssd {

        id -3   # do not change unnecessarily

        # weight 1.740

        alg straw

        hash 0  # rjenkins1

        item osd.2 weight 0.870

        item osd.3 weight 0.870

}

host pvestorage1-platter {

        id -4           # do not change unnecessarily

        # weight 4

        alg straw

        hash 0  # rjenkins1

        item osd.4 weight 2.000

        item osd.5 weight 2.000

}

host pvestorage2-platter {

        id -5           # do not change unnecessarily

        # weight 4

        alg straw

        hash 0  # rjenkins1

        item osd.6 weight 2.000

        item osd.7 weight 2.000

}

root ssd {

        id -1   # do not change unnecessarily

        # weight 3.480

        alg straw

        hash 0  # rjenkins1

        item pvestorage1-ssd weight 1.740

        item pvestorage2-ssd weight 1.740

}

root platter {

        id -6           # do not change unnecessarily

        # weight 8

        alg straw

        hash 0  # rjenkins1

        item pvestorage1-platter weight 4.000

        item pvestorage2-platter weight 4.000

}

# rules

rule ssd {

        ruleset 0

        type replicated

        min_size 1

        max_size 10

        step take ssd

        step chooseleaf firstn 0 type host

        step emit

}

rule platter {

        ruleset 1

        type replicated

        min_size 1

        max_size 10

        step take platter

        step chooseleaf firstn 0 type host

        step emit

}

# end crush map

################################################################################

This is the what ceph made of this crushmap and the one that is actually used right now, I never looked -_- :

# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable straw_calc_version 1

# devices

device 0 osd.0

device 1 osd.1

device 2 osd.2

device 3 osd.3

device 4 osd.4

device 5 osd.5

device 6 osd.6

device 7 osd.7

# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root

# buckets

host pvestorage1-ssd {

        id -2   # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

}

host pvestorage2-ssd {

        id -3   # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

}

root ssd {

        id -1   # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

        item pvestorage1-ssd weight 0.000

        item pvestorage2-ssd weight 0.000

}

host pvestorage1-platter {

        id -4   # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

}

host pvestorage2-platter {

        id -5   # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

}

root platter {

        id -6   # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

        item pvestorage1-platter weight 0.000

        item pvestorage2-platter weight 0.000

}

host pvestorage1 {

        id -7   # do not change unnecessarily

        # weight 5.740

        alg straw

        hash 0  # rjenkins1

        item osd.5 weight 2.000

        item osd.4 weight 2.000

        item osd.1 weight 0.870

        item osd.0 weight 0.870

}

host pvestorage2 {

        id -9   # do not change unnecessarily

        # weight 5.740

        alg straw

        hash 0  # rjenkins1

        item osd.3 weight 0.870

        item osd.2 weight 0.870

        item osd.6 weight 2.000

        item osd.7 weight 2.000

}

root default {

        id -8   # do not change unnecessarily

        # weight 11.480

        alg straw

        hash 0  # rjenkins1

        item pvestorage1 weight 5.740

        item pvestorage2 weight 5.740

}

# rules

rule ssd {

        ruleset 0

        type replicated

        min_size 1

        max_size 10

        step take ssd

        step chooseleaf firstn 0 type host

        step emit

}

rule platter {

        ruleset 1

        type replicated

        min_size 1

        max_size 10

        step take platter

        step chooseleaf firstn 0 type host

        step emit

}

# end crush map

################################################################################

How do I recover from this?

Best Regards

Jonas

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com