Re: Some OSDs are down after Server reboot

"Joe Comeau" <Joe.Comeau@xxxxxxxxxx> · Fri, 15 Sep 2017 12:05:13 -0700

We're running journals on NVMe as well - SLES 

before rebooting try deleting the links here:
 /etc/systemd/system/ceph-osd.target.wants/

if we delete first it boots ok
if we don't delete the disks sometimes don't come up and we have to ceph-disk activate all

HTH

Thanks Joe

>>> David Turner <drakonstein@xxxxxxxxx> 9/15/2017 9:54 AM >>>

I have this issue with my NVMe OSDs, but not my HDD OSDs. I have 15 HDD's and 2 NVMe's in each host. We put most of the journals on one of the NVMe's and a few on the second, but added a small OSD partition to the second NVMe for RGW metadata pools.

When restarting a server manually for testing, the NVMe OSD comes back up normally. We're tracking a problem with the OSD nodes freezing and having to force reboot them. After this, the NVMe OSD doesn't come back on its own until I run `ceph-disk activate-all`. This seems to track with your theory that a non-clean FS is a part of the equation.

Is there any ideas as to how to resolve this yet? So far being able to run `ceph-disk activate-all` is good enough, but a bit of a nuisance.

On Fri, Sep 15, 2017 at 11:48 AM Matthew Vernon <mv3@xxxxxxxxxxxx> wrote:

Hi,

On 14/09/17 16:26, Götz Reinicke wrote:

> maybe someone has a hint: I do have a cephalopod cluster (6 nodes, 144
> OSDs), Cents 7.3 ceph 10.2.7.
>
> I did a kernel update to the recent centos 7.3 one on a node and did a
> reboot.
>
> After that, 10 OSDs did not came up as the others. The disk did not get
> mounted and the OSD processes did nothing … even after a couple of
> minutes no more disks/OSDs showed up.
>
> So I did a ceph-disk activate-all.
>
> And all missing OSDs got back online.
>
> Questions: Any hints on debugging why the disk did not get online after
> the reboot?

We've been seeing this on our Ubuntu / Jewel cluster, after we upgraded
from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93.

I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.

Regards,

Matthew

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com