Re: ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

Sean Purdy <s.purdy@xxxxxxxxxxxxxxxx> · Thu, 28 Sep 2017 14:14:39 +0100

On Thu, 28 Sep 2017, Matthew Vernon said:
> Hi,
> 
> TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it
> needs increasing and/or removing entirely. Should I copy this to ceph-devel?

Just a note.  Looks like debian stretch luminous packages have a 10_000 second timeout:

from /lib/systemd/system/ceph-disk@.service

Environment=CEPH_DISK_TIMEOUT=10000
ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'

Sean

> On 15/09/17 16:48, Matthew Vernon wrote:
> >On 14/09/17 16:26, Götz Reinicke wrote:
> >>After that, 10 OSDs did not came up as the others. The disk did not get
> >>mounted and the OSD processes did nothing … even after a couple of
> >>minutes no more disks/OSDs showed up.
> >
> >I'm still digging, but AFAICT it's a race condition in startup - in our
> >case, we're only seeing it if some of the filesystems aren't clean. This
> >may be related to the thread "Very slow start of osds after reboot" from
> >August, but I don't think any conclusion was reached there.
> 
> This annoyed me enough that I went off to find the problem :-)
> 
> On systemd-enabled machines[0] ceph disks are activated by systemd's
> ceph-disk@.service, which calls:
> 
> /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f)
> /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'
> 
> ceph-disk trigger --sync calls ceph-disk activate which (among other things)
> mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/
> once it's extracted the osd number from the fs). If the fs is unclean, XFS
> auto-recovers before mounting (which takes time - range 2-25s for our 6TB
> disks) Importantly, there is a single global lock file[1] so only one
> ceph-disk activate can be doing this at once.
> 
> So, each fs is auto-recovering one at at time (rather than in parallel), and
> once the elapsed time gets past 120s, timeout kills the flock, systemd kills
> the cgroup, and no more OSDs start up - we typically find a few fs mounted
> in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to start the remaining
> osds (via ceph-osd@.service), but their fs isn't in the correct place, so
> this never works.
> 
> The fix/workaround is to adjust the timeout value (edit the service file
> directly, or for style points write an override in /etc/systemd/system
> remembering you need a blank ExecStart line before your revised one).
> 
> Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to
> start all its osds when started up with all fss dirty. So the current 120s
> is far too small (it's just about OK when all the osd fss are clean).
> 
> I think, though, that having the timeout at all is a bug - if something
> needs to time out under some circumstances, should it be at a lower layer,
> perhaps?
> 
> A couple of final points/asides, if I may:
> 
> ceph-disk trigger uses subprocess.communicate (via the command() function),
> which means it swallows the log output from ceph-disk activate (and only
> outputs it after that process finishes) - as well as producing confusing
> timestamps, this means that when systemd kills the cgroup, all the output
> from the ceph-disk activate command vanishes into the void. That made
> debugging needlessly hard. Better to let called processes like that output
> immediately?
> 
> Does each fs need mounting twice? could the osd be encoded in the partition
> label or similar instead?
> 
> Is a single global activation lock necessary? It slows startup down quite a
> bit; I see no reason why (at least in the one-osd-per-disk case) you
> couldn't be activating all the osds at once...
> 
> Regards,
> 
> Matthew
> 
> [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the
> timeout, so presumably upstart systems aren't affected
> [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE. _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com