pgs stuck in creating+peering state

Johan Thomsen <write@xxxxxxxxxx> · Thu, 17 Jan 2019 11:56:39 +0100

Hi,

I have a sad ceph cluster.
All my osds complain about failed reply on heartbeat, like so:

osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
ever on either front or back, first ping sent 2019-01-16
22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)

.. I've checked the network sanity all I can, and all ceph ports are
open between nodes both on the public network and the cluster network,
and I have no problems sending traffic back and forth between nodes.
I've tried tcpdump'ing and traffic is passing in both directions
between the nodes, but unfortunately I don't natively speak the ceph
protocol, so I can't figure out what's going wrong in the heartbeat
conversation.

Still:

# ceph health detail

HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
pgs inactive, 1072 pgs peering
OSDMAP_FLAGS nodown,noout flag(s) set
PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
    pg 7.3cd is stuck inactive for 245901.560813, current state
creating+peering, last acting [13,41,1]
    pg 7.3ce is stuck peering for 245901.560813, current state
creating+peering, last acting [1,40,7]
    pg 7.3cf is stuck peering for 245901.560813, current state
creating+peering, last acting [0,42,9]
    pg 7.3d0 is stuck peering for 245901.560813, current state
creating+peering, last acting [20,8,38]
    pg 7.3d1 is stuck peering for 245901.560813, current state
creating+peering, last acting [10,20,42]
   (....)

I've set "noout" and "nodown" to prevent all osd's from being removed
from the cluster. They are all running and marked as "up".

# ceph osd tree

ID  CLASS WEIGHT    TYPE NAME                          STATUS REWEIGHT PRI-AFF
 -1       249.73434 root default
-25       166.48956     datacenter m1
-24        83.24478         pod kube1
-35        41.62239             rack 10
-34        41.62239                 host ceph-sto-p102
 40   hdd   7.27689                     osd.40             up  1.00000 1.00000
 41   hdd   7.27689                     osd.41             up  1.00000 1.00000
 42   hdd   7.27689                     osd.42             up  1.00000 1.00000
   (....)

I'm at a point where I don't know which options and what logs to check anymore?

Any debug hint would be very much appreciated.

btw. I have no important data in the cluster (yet), so if the solution
is to drop all osd and recreate them, it's ok for now. But I'd really
like to know how the cluster ended in this state.

/Johan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com