On Thu, 28 Sep 2017, Matthew Vernon said: > Hi, > > TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it > needs increasing and/or removing entirely. Should I copy this to ceph-devel? Just a note. Looks like debian stretch luminous packages have a 10_000 second timeout: from /lib/systemd/system/ceph-disk@.service Environment=CEPH_DISK_TIMEOUT=10000 ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' Sean > On 15/09/17 16:48, Matthew Vernon wrote: > >On 14/09/17 16:26, Götz Reinicke wrote: > >>After that, 10 OSDs did not came up as the others. The disk did not get > >>mounted and the OSD processes did nothing … even after a couple of > >>minutes no more disks/OSDs showed up. > > > >I'm still digging, but AFAICT it's a race condition in startup - in our > >case, we're only seeing it if some of the filesystems aren't clean. This > >may be related to the thread "Very slow start of osds after reboot" from > >August, but I don't think any conclusion was reached there. > > This annoyed me enough that I went off to find the problem :-) > > On systemd-enabled machines[0] ceph disks are activated by systemd's > ceph-disk@.service, which calls: > > /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) > /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' > > ceph-disk trigger --sync calls ceph-disk activate which (among other things) > mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/ > once it's extracted the osd number from the fs). If the fs is unclean, XFS > auto-recovers before mounting (which takes time - range 2-25s for our 6TB > disks) Importantly, there is a single global lock file[1] so only one > ceph-disk activate can be doing this at once. > > So, each fs is auto-recovering one at at time (rather than in parallel), and > once the elapsed time gets past 120s, timeout kills the flock, systemd kills > the cgroup, and no more OSDs start up - we typically find a few fs mounted > in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to start the remaining > osds (via ceph-osd@.service), but their fs isn't in the correct place, so > this never works. > > The fix/workaround is to adjust the timeout value (edit the service file > directly, or for style points write an override in /etc/systemd/system > remembering you need a blank ExecStart line before your revised one). > > Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to > start all its osds when started up with all fss dirty. So the current 120s > is far too small (it's just about OK when all the osd fss are clean). > > I think, though, that having the timeout at all is a bug - if something > needs to time out under some circumstances, should it be at a lower layer, > perhaps? > > A couple of final points/asides, if I may: > > ceph-disk trigger uses subprocess.communicate (via the command() function), > which means it swallows the log output from ceph-disk activate (and only > outputs it after that process finishes) - as well as producing confusing > timestamps, this means that when systemd kills the cgroup, all the output > from the ceph-disk activate command vanishes into the void. That made > debugging needlessly hard. Better to let called processes like that output > immediately? > > Does each fs need mounting twice? could the osd be encoded in the partition > label or similar instead? > > Is a single global activation lock necessary? It slows startup down quite a > bit; I see no reason why (at least in the one-osd-per-disk case) you > couldn't be activating all the osds at once... > > Regards, > > Matthew > > [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the > timeout, so presumably upstart systems aren't affected > [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, > a charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com