OSDs stuck in booting state after redeploying

Kostis Fardelas <dante1234@xxxxxxxxx> · Wed, 15 Jun 2016 17:14:53 +0300

Hello,
in the process of redeploying some OSDs in our cluster, after
destroying one of them (down, out, remove from crushmap) and trying to
redeploy it (crush add ,start), we reach a state where the OSD gets
stuck at booting state:
root@staging-rd0-02:~# ceph daemon osd.12 status
{ "cluster_fsid": "XXXXXXXXXXX",
  "osd_fsid": "XXXXXXXXXXXXXX",
  "whoami": 12,
  "state": "booting",
  "oldest_map": 150201,
  "newest_map": 150779,
  "num_pgs": 0}

No flags that could prevent the OSD to get up is in place. The OSD
never gets marked as up in 'ceph osd tree' and never gets in. If I try
to manual get it in, it gets out after a while. The cluster OSD map
keeps going forward, but the OSD cannot catch-up of course. I started
the OSD with debugging options:
debug osd = 20
debug filestore = 20
debug journal = 20
debug monc = 20
debug ms = 1

and what I see is contiuning OSD logs of this kind:
2016-06-15 16:39:33.876339 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:33.876343 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:34.390560 7f022e2ee700 20 osd.12 150798
update_osd_stat osd_stat(59384 kB used, 558 GB avail, 558 GB total,
peers []/[] op hist [])
2016-06-15 16:39:34.390622 7f022e2ee700  5 osd.12 150798 heartbeat:
osd_stat(59384 kB used, 558 GB avail, 558 GB total, peers []/[] op
hist [])
2016-06-15 16:39:34.876526 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:34.876561 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:34.876565 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:35.876729 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:35.876762 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:35.876766 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:36.646355 7f025535e700 20
filestore(/rados/staging-rd0-02-12) sync_entry woke after 30.000161
2016-06-15 16:39:36.646421 7f025535e700 20
filestore(/rados/staging-rd0-02-12) sync_entry waiting for
max_interval 30.000000
2016-06-15 16:39:36.876917 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:36.876949 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:36.876953 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:37.877112 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:37.877142 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:37.877147 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:38.877298 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:38.877327 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:38.877331 7f0256b61700 10 osd.12 150798 do_waiters -- finish

Is there a solution for this problem? Known bug? We are on firefly
(0.80.11) and wanted to do some maintenance before going to hammer,
but now we are somewhat stuck.

Best regards,
Kostis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com