Running ceph 12.2.2 in Centos 7.4. The cluster was in healthy condition until a command caused all the monitors to crash. Applied a private build for fixing the issue (thanks !) https://tracker.ceph.com/issues/22847 the monitors are all started, and all the OSDs are reported as been up in ceph -s, but the OSD themselves are reporting as “Booting”, so none of the PGs recovered: (Please see attached OSD debug logs, seems to be looping through STATE_ACCEPTING_WAIT_BANNER_ADDR) ceph -s cluster: id: 021a1428-fea5-4697-bcd0-a45c1c2ca80b health: HEALTH_WARN Reduced data availability: 10240 pgs inactive, 3 pgs down, 4195 pgs peering services: mon: 5 daemons, quorum dl1-kaf101,dl1-kaf201,dl1-kaf301,dl1-kaf302,dl1-kaf401 mgr: dl1-kaf101(active) osd: 64 osds: 64 up, 64 in; 100 remapped pgs data: pools: 3 pools, 10240 pgs objects: 94810 objects, 366 GB usage: 2376 GB used, 515 TB / 518 TB avail pgs: 59.004% pgs unknown 40.996% pgs not active 6042 unknown 4195 peering 3 down OSD output: ceph --admin-daemon /var/run/ceph/dl1approd-osd.3.asok status { "cluster_fsid": "021a1428-fea5-4697-bcd0-a45c1c2ca80b", "osd_fsid": "63d816a2-beb3-4b94-8f34-62fa1ffc32ce", "whoami": 3, "state": "booting", "oldest_map": 133275, "newest_map": 133997, "num_pgs": 439 } Config: [global] debug ms = 5/5 debug heartbeatmap = 5/5 mon osd down out interval = 30 mon osd min down reports = 2 osd heartbeat grace = 35 osd mon heartbeat interval = 20 osd mon report interval max = 30 osd mon ack timeout = 15 fsid = 021a1428-fea5-4697-bcd0-a45c1c2ca80b auth cluster required = cephx auth service required = cephx auth client required = cephx mon osd allow primary affinity = true -- Efficiency is Intelligent Laziness |
Attachment:
osd.log
Description: osd.log
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com