ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

Matthew Vernon <mv3@xxxxxxxxxxxx> · Thu, 28 Sep 2017 13:15:25 +0100

Hi,

TL;DR - the timeout setting in ceph-disk@.service is (far) too small - 
it needs increasing and/or removing entirely. Should I copy this to 
ceph-devel?

On 15/09/17 16:48, Matthew Vernon wrote:
On 14/09/17 16:26, Götz Reinicke wrote:
After that, 10 OSDs did not came up as the others. The disk did not get
mounted and the OSD processes did nothing … even after a couple of
minutes no more disks/OSDs showed up.

I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.

This annoyed me enough that I went off to find the problem :-)

On systemd-enabled machines[0] ceph disks are activated by systemd's 
ceph-disk@.service, which calls:

/bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) 
/usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'

ceph-disk trigger --sync calls ceph-disk activate which (among other 
things) mounts the osd fs (first in a temporary location, then in 
/var/lib/ceph/osd/ once it's extracted the osd number from the fs). If 
the fs is unclean, XFS auto-recovers before mounting (which takes time - 
range 2-25s for our 6TB disks) Importantly, there is a single global 
lock file[1] so only one ceph-disk activate can be doing this at once.

So, each fs is auto-recovering one at at time (rather than in parallel), 
and once the elapsed time gets past 120s, timeout kills the flock, 
systemd kills the cgroup, and no more OSDs start up - we typically find 
a few fs mounted in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to 
start the remaining osds (via ceph-osd@.service), but their fs isn't in 
the correct place, so this never works.

The fix/workaround is to adjust the timeout value (edit the service file 
directly, or for style points write an override in /etc/systemd/system 
remembering you need a blank ExecStart line before your revised one).

Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to 
start all its osds when started up with all fss dirty. So the current 
120s is far too small (it's just about OK when all the osd fss are clean).

I think, though, that having the timeout at all is a bug - if something 
needs to time out under some circumstances, should it be at a lower 
layer, perhaps?

A couple of final points/asides, if I may:

ceph-disk trigger uses subprocess.communicate (via the command() 
function), which means it swallows the log output from ceph-disk 
activate (and only outputs it after that process finishes) - as well as 
producing confusing timestamps, this means that when systemd kills the 
cgroup, all the output from the ceph-disk activate command vanishes into 
the void. That made debugging needlessly hard. Better to let called 
processes like that output immediately?

Does each fs need mounting twice? could the osd be encoded in the 
partition label or similar instead?

Is a single global activation lock necessary? It slows startup down 
quite a bit; I see no reason why (at least in the one-osd-per-disk case) 
you couldn't be activating all the osds at once...

Regards,

Matthew

[0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the 
timeout, so presumably upstart systems aren't affected
[1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu

--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com