Re: Some OSDs are down after Server reboot

Matthew Vernon <mv3@xxxxxxxxxxxx> · Fri, 15 Sep 2017 16:48:19 +0100

Hi,

On 14/09/17 16:26, Götz Reinicke wrote:

> maybe someone has a hint: I do have a cephalopod cluster (6 nodes, 144
> OSDs), Cents 7.3 ceph 10.2.7.
> 
> I did a kernel update to the recent centos 7.3 one on a node and did a
> reboot.
> 
> After that, 10 OSDs did not came up as the others. The disk did not get
> mounted and the OSD processes did nothing … even after a couple of
> minutes no more disks/OSDs showed up.
> 
> So I did a ceph-disk activate-all.
> 
> And all missing OSDs got back online.
> 
> Questions: Any hints on debugging why the disk did not get online after
> the reboot?

We've been seeing this on our Ubuntu / Jewel cluster, after we upgraded
from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93.

I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.

Regards,

Matthew

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com