Are you sure, no service like firewalld is running? Did you check that all machines have the same MTU and jumbo frames are enabled if needed? I had this problem when I first started with ceph and forgot to disable firewalld. Replication worked perfectly fine but the OSD was kicked out every few seconds. Kevin Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen <write@xxxxxxxxxx>: > > Hi, > > I have a sad ceph cluster. > All my osds complain about failed reply on heartbeat, like so: > > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42 > ever on either front or back, first ping sent 2019-01-16 > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353) > > .. I've checked the network sanity all I can, and all ceph ports are > open between nodes both on the public network and the cluster network, > and I have no problems sending traffic back and forth between nodes. > I've tried tcpdump'ing and traffic is passing in both directions > between the nodes, but unfortunately I don't natively speak the ceph > protocol, so I can't figure out what's going wrong in the heartbeat > conversation. > > Still: > > # ceph health detail > > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072 > pgs inactive, 1072 pgs peering > OSDMAP_FLAGS nodown,noout flag(s) set > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering > pg 7.3cd is stuck inactive for 245901.560813, current state > creating+peering, last acting [13,41,1] > pg 7.3ce is stuck peering for 245901.560813, current state > creating+peering, last acting [1,40,7] > pg 7.3cf is stuck peering for 245901.560813, current state > creating+peering, last acting [0,42,9] > pg 7.3d0 is stuck peering for 245901.560813, current state > creating+peering, last acting [20,8,38] > pg 7.3d1 is stuck peering for 245901.560813, current state > creating+peering, last acting [10,20,42] > (....) > > > I've set "noout" and "nodown" to prevent all osd's from being removed > from the cluster. They are all running and marked as "up". > > # ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 249.73434 root default > -25 166.48956 datacenter m1 > -24 83.24478 pod kube1 > -35 41.62239 rack 10 > -34 41.62239 host ceph-sto-p102 > 40 hdd 7.27689 osd.40 up 1.00000 1.00000 > 41 hdd 7.27689 osd.41 up 1.00000 1.00000 > 42 hdd 7.27689 osd.42 up 1.00000 1.00000 > (....) > > I'm at a point where I don't know which options and what logs to check anymore? > > Any debug hint would be very much appreciated. > > btw. I have no important data in the cluster (yet), so if the solution > is to drop all osd and recreate them, it's ok for now. But I'd really > like to know how the cluster ended in this state. > > /Johan > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com