OSDs stuck in booting state after redeploying

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,
in the process of redeploying some OSDs in our cluster, after
destroying one of them (down, out, remove from crushmap) and trying to
redeploy it (crush add ,start), we reach a state where the OSD gets
stuck at booting state:
root@staging-rd0-02:~# ceph daemon osd.12 status
{ "cluster_fsid": "XXXXXXXXXXX",
  "osd_fsid": "XXXXXXXXXXXXXX",
  "whoami": 12,
  "state": "booting",
  "oldest_map": 150201,
  "newest_map": 150779,
  "num_pgs": 0}

No flags that could prevent the OSD to get up is in place. The OSD
never gets marked as up in 'ceph osd tree' and never gets in. If I try
to manual get it in, it gets out after a while. The cluster OSD map
keeps going forward, but the OSD cannot catch-up of course. I started
the OSD with debugging options:
debug osd = 20
debug filestore = 20
debug journal = 20
debug monc = 20
debug ms = 1

and what I see is contiuning OSD logs of this kind:
2016-06-15 16:39:33.876339 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:33.876343 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:34.390560 7f022e2ee700 20 osd.12 150798
update_osd_stat osd_stat(59384 kB used, 558 GB avail, 558 GB total,
peers []/[] op hist [])
2016-06-15 16:39:34.390622 7f022e2ee700  5 osd.12 150798 heartbeat:
osd_stat(59384 kB used, 558 GB avail, 558 GB total, peers []/[] op
hist [])
2016-06-15 16:39:34.876526 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:34.876561 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:34.876565 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:35.876729 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:35.876762 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:35.876766 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:36.646355 7f025535e700 20
filestore(/rados/staging-rd0-02-12) sync_entry woke after 30.000161
2016-06-15 16:39:36.646421 7f025535e700 20
filestore(/rados/staging-rd0-02-12) sync_entry waiting for
max_interval 30.000000
2016-06-15 16:39:36.876917 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:36.876949 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:36.876953 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:37.877112 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:37.877142 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:37.877147 7f0256b61700 10 osd.12 150798 do_waiters -- finish
2016-06-15 16:39:38.877298 7f0256b61700  5 osd.12 150798 tick
2016-06-15 16:39:38.877327 7f0256b61700 10 osd.12 150798 do_waiters -- start
2016-06-15 16:39:38.877331 7f0256b61700 10 osd.12 150798 do_waiters -- finish

Is there a solution for this problem? Known bug? We are on firefly
(0.80.11) and wanted to do some maintenance before going to hammer,
but now we are somewhat stuck.

Best regards,
Kostis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux