Hi, I have a sad ceph cluster. All my osds complain about failed reply on heartbeat, like so: osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42 ever on either front or back, first ping sent 2019-01-16 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353) .. I've checked the network sanity all I can, and all ceph ports are open between nodes both on the public network and the cluster network, and I have no problems sending traffic back and forth between nodes. I've tried tcpdump'ing and traffic is passing in both directions between the nodes, but unfortunately I don't natively speak the ceph protocol, so I can't figure out what's going wrong in the heartbeat conversation. Still: # ceph health detail HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072 pgs inactive, 1072 pgs peering OSDMAP_FLAGS nodown,noout flag(s) set PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering pg 7.3cd is stuck inactive for 245901.560813, current state creating+peering, last acting [13,41,1] pg 7.3ce is stuck peering for 245901.560813, current state creating+peering, last acting [1,40,7] pg 7.3cf is stuck peering for 245901.560813, current state creating+peering, last acting [0,42,9] pg 7.3d0 is stuck peering for 245901.560813, current state creating+peering, last acting [20,8,38] pg 7.3d1 is stuck peering for 245901.560813, current state creating+peering, last acting [10,20,42] (....) I've set "noout" and "nodown" to prevent all osd's from being removed from the cluster. They are all running and marked as "up". # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 249.73434 root default -25 166.48956 datacenter m1 -24 83.24478 pod kube1 -35 41.62239 rack 10 -34 41.62239 host ceph-sto-p102 40 hdd 7.27689 osd.40 up 1.00000 1.00000 41 hdd 7.27689 osd.41 up 1.00000 1.00000 42 hdd 7.27689 osd.42 up 1.00000 1.00000 (....) I'm at a point where I don't know which options and what logs to check anymore? Any debug hint would be very much appreciated. btw. I have no important data in the cluster (yet), so if the solution is to drop all osd and recreate them, it's ok for now. But I'd really like to know how the cluster ended in this state. /Johan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com