Re: OSDs stuck in booting state after redeploying

Kostis Fardelas <dante1234@xxxxxxxxx> · Thu, 16 Jun 2016 13:29:41 +0300

Answering to myself and to whom may be interested. After some strace
and better looking in logs, I realized that the cluster knew different
fsids for my redeployed OSDs, so I realized that I did not 'rm' the
OSDs before readding them to the cluster

So the fact is that ceph does not update OSD fsids from redeployed
OSDs, even after removing the old ones from crushmap. You need to rm
them

Regards,
Kostis

On 15 June 2016 at 17:14, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
> Hello,
> in the process of redeploying some OSDs in our cluster, after
> destroying one of them (down, out, remove from crushmap) and trying to
> redeploy it (crush add ,start), we reach a state where the OSD gets
> stuck at booting state:
> root@staging-rd0-02:~# ceph daemon osd.12 status
> { "cluster_fsid": "XXXXXXXXXXX",
>   "osd_fsid": "XXXXXXXXXXXXXX",
>   "whoami": 12,
>   "state": "booting",
>   "oldest_map": 150201,
>   "newest_map": 150779,
>   "num_pgs": 0}
>
> No flags that could prevent the OSD to get up is in place. The OSD
> never gets marked as up in 'ceph osd tree' and never gets in. If I try
> to manual get it in, it gets out after a while. The cluster OSD map
> keeps going forward, but the OSD cannot catch-up of course. I started
> the OSD with debugging options:
> debug osd = 20
> debug filestore = 20
> debug journal = 20
> debug monc = 20
> debug ms = 1
>
> and what I see is contiuning OSD logs of this kind:
> 2016-06-15 16:39:33.876339 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:33.876343 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:34.390560 7f022e2ee700 20 osd.12 150798
> update_osd_stat osd_stat(59384 kB used, 558 GB avail, 558 GB total,
> peers []/[] op hist [])
> 2016-06-15 16:39:34.390622 7f022e2ee700  5 osd.12 150798 heartbeat:
> osd_stat(59384 kB used, 558 GB avail, 558 GB total, peers []/[] op
> hist [])
> 2016-06-15 16:39:34.876526 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:34.876561 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:34.876565 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:35.876729 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:35.876762 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:35.876766 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:36.646355 7f025535e700 20
> filestore(/rados/staging-rd0-02-12) sync_entry woke after 30.000161
> 2016-06-15 16:39:36.646421 7f025535e700 20
> filestore(/rados/staging-rd0-02-12) sync_entry waiting for
> max_interval 30.000000
> 2016-06-15 16:39:36.876917 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:36.876949 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:36.876953 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:37.877112 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:37.877142 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:37.877147 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:38.877298 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:38.877327 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:38.877331 7f0256b61700 10 osd.12 150798 do_waiters -- finish
>
> Is there a solution for this problem? Known bug? We are on firefly
> (0.80.11) and wanted to do some maintenance before going to hammer,
> but now we are somewhat stuck.
>
> Best regards,
> Kostis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com