All pgs stuck peering

Chris Dunlop <chris@xxxxxxxxxxxx> · Mon, 14 Dec 2015 13:52:16 +1100

Hi,

ceph 0.94.5

After restarting one of our three osd hosts to increase the RAM and change
from linux 3.18.21 to 4.1., the cluster is stuck with all pgs peering:

# ceph -s
    cluster c6618970-0ce0-4cb2-bc9a-dd5f29b62e24
     health HEALTH_WARN
            3072 pgs peering
            3072 pgs stuck inactive
            3072 pgs stuck unclean
            1450 requests are blocked > 32 sec
            noout flag(s) set
     monmap e9: 3 mons at {b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}
            election epoch 74462, quorum 0,1,2 b2,b4,b5
     osdmap e356963: 59 osds: 59 up, 59 in
            flags noout
      pgmap v69385733: 3072 pgs, 3 pools, 11973 GB data, 3340 kobjects
            31768 GB used, 102 TB / 133 TB avail
                3072 peering

What can I do to diagnose (or better yet, fix!) this?

Downgrading back to 3.18.21 hasn't helped.

Each host (now) has 192G RAM. One has 17 osds, the other two have 21 osds
each.

I can see there's traffic going between the osd ports on the various osd
hosts, but all small packets (122 or 131 bytes).

Just prior to upgrading this osd host another one had also been upgraded
(RAM + linux). The cluster had no trouble at that point and was healthy
within a few minutes of that server starting up.

The cluster has been working fine for years up to now, having had rolling
upgrades since dumpling.

Cheers,

Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com