Re: pgs stuck in creating+peering state

Kevin Olbrich <ko@xxxxxxx> · Thu, 17 Jan 2019 12:00:40 +0100

Are you sure, no service like firewalld is running?
Did you check that all machines have the same MTU and jumbo frames are
enabled if needed?

I had this problem when I first started with ceph and forgot to
disable firewalld.
Replication worked perfectly fine but the OSD was kicked out every few seconds.

Kevin

Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen <write@xxxxxxxxxx>:
>
> Hi,
>
> I have a sad ceph cluster.
> All my osds complain about failed reply on heartbeat, like so:
>
> osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> ever on either front or back, first ping sent 2019-01-16
> 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
>
> .. I've checked the network sanity all I can, and all ceph ports are
> open between nodes both on the public network and the cluster network,
> and I have no problems sending traffic back and forth between nodes.
> I've tried tcpdump'ing and traffic is passing in both directions
> between the nodes, but unfortunately I don't natively speak the ceph
> protocol, so I can't figure out what's going wrong in the heartbeat
> conversation.
>
> Still:
>
> # ceph health detail
>
> HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> pgs inactive, 1072 pgs peering
> OSDMAP_FLAGS nodown,noout flag(s) set
> PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
>     pg 7.3cd is stuck inactive for 245901.560813, current state
> creating+peering, last acting [13,41,1]
>     pg 7.3ce is stuck peering for 245901.560813, current state
> creating+peering, last acting [1,40,7]
>     pg 7.3cf is stuck peering for 245901.560813, current state
> creating+peering, last acting [0,42,9]
>     pg 7.3d0 is stuck peering for 245901.560813, current state
> creating+peering, last acting [20,8,38]
>     pg 7.3d1 is stuck peering for 245901.560813, current state
> creating+peering, last acting [10,20,42]
>    (....)
>
>
> I've set "noout" and "nodown" to prevent all osd's from being removed
> from the cluster. They are all running and marked as "up".
>
> # ceph osd tree
>
> ID  CLASS WEIGHT    TYPE NAME                          STATUS REWEIGHT PRI-AFF
>  -1       249.73434 root default
> -25       166.48956     datacenter m1
> -24        83.24478         pod kube1
> -35        41.62239             rack 10
> -34        41.62239                 host ceph-sto-p102
>  40   hdd   7.27689                     osd.40             up  1.00000 1.00000
>  41   hdd   7.27689                     osd.41             up  1.00000 1.00000
>  42   hdd   7.27689                     osd.42             up  1.00000 1.00000
>    (....)
>
> I'm at a point where I don't know which options and what logs to check anymore?
>
> Any debug hint would be very much appreciated.
>
> btw. I have no important data in the cluster (yet), so if the solution
> is to drop all osd and recreate them, it's ok for now. But I'd really
> like to know how the cluster ended in this state.
>
> /Johan
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com