Hi,
TL;DR - the timeout setting in ceph-disk@.service is (far) too small -
it needs increasing and/or removing entirely. Should I copy this to
ceph-devel?
On 15/09/17 16:48, Matthew Vernon wrote:
On 14/09/17 16:26, Götz Reinicke wrote:
After that, 10 OSDs did not came up as the others. The disk did not get
mounted and the OSD processes did nothing … even after a couple of
minutes no more disks/OSDs showed up.
I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.
This annoyed me enough that I went off to find the problem :-)
On systemd-enabled machines[0] ceph disks are activated by systemd's
ceph-disk@.service, which calls:
/bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f)
/usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'
ceph-disk trigger --sync calls ceph-disk activate which (among other
things) mounts the osd fs (first in a temporary location, then in
/var/lib/ceph/osd/ once it's extracted the osd number from the fs). If
the fs is unclean, XFS auto-recovers before mounting (which takes time -
range 2-25s for our 6TB disks) Importantly, there is a single global
lock file[1] so only one ceph-disk activate can be doing this at once.
So, each fs is auto-recovering one at at time (rather than in parallel),
and once the elapsed time gets past 120s, timeout kills the flock,
systemd kills the cgroup, and no more OSDs start up - we typically find
a few fs mounted in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to
start the remaining osds (via ceph-osd@.service), but their fs isn't in
the correct place, so this never works.
The fix/workaround is to adjust the timeout value (edit the service file
directly, or for style points write an override in /etc/systemd/system
remembering you need a blank ExecStart line before your revised one).
Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to
start all its osds when started up with all fss dirty. So the current
120s is far too small (it's just about OK when all the osd fss are clean).
I think, though, that having the timeout at all is a bug - if something
needs to time out under some circumstances, should it be at a lower
layer, perhaps?
A couple of final points/asides, if I may:
ceph-disk trigger uses subprocess.communicate (via the command()
function), which means it swallows the log output from ceph-disk
activate (and only outputs it after that process finishes) - as well as
producing confusing timestamps, this means that when systemd kills the
cgroup, all the output from the ceph-disk activate command vanishes into
the void. That made debugging needlessly hard. Better to let called
processes like that output immediately?
Does each fs need mounting twice? could the osd be encoded in the
partition label or similar instead?
Is a single global activation lock necessary? It slows startup down
quite a bit; I see no reason why (at least in the one-osd-per-disk case)
you couldn't be activating all the osds at once...
Regards,
Matthew
[0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the
timeout, so presumably upstart systems aren't affected
[1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com