This looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1458007 or one of the bugs/trackers attached to that. On Thu, Sep 28, 2017 at 11:14 PM, Sean Purdy <s.purdy@xxxxxxxxxxxxxxxx> wrote: > On Thu, 28 Sep 2017, Matthew Vernon said: >> Hi, >> >> TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it >> needs increasing and/or removing entirely. Should I copy this to ceph-devel? > > Just a note. Looks like debian stretch luminous packages have a 10_000 second timeout: > > from /lib/systemd/system/ceph-disk@.service > > Environment=CEPH_DISK_TIMEOUT=10000 > ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' > > > Sean > >> On 15/09/17 16:48, Matthew Vernon wrote: >> >On 14/09/17 16:26, Götz Reinicke wrote: >> >>After that, 10 OSDs did not came up as the others. The disk did not get >> >>mounted and the OSD processes did nothing … even after a couple of >> >>minutes no more disks/OSDs showed up. >> > >> >I'm still digging, but AFAICT it's a race condition in startup - in our >> >case, we're only seeing it if some of the filesystems aren't clean. This >> >may be related to the thread "Very slow start of osds after reboot" from >> >August, but I don't think any conclusion was reached there. >> >> This annoyed me enough that I went off to find the problem :-) >> >> On systemd-enabled machines[0] ceph disks are activated by systemd's >> ceph-disk@.service, which calls: >> >> /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) >> /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' >> >> ceph-disk trigger --sync calls ceph-disk activate which (among other things) >> mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/ >> once it's extracted the osd number from the fs). If the fs is unclean, XFS >> auto-recovers before mounting (which takes time - range 2-25s for our 6TB >> disks) Importantly, there is a single global lock file[1] so only one >> ceph-disk activate can be doing this at once. >> >> So, each fs is auto-recovering one at at time (rather than in parallel), and >> once the elapsed time gets past 120s, timeout kills the flock, systemd kills >> the cgroup, and no more OSDs start up - we typically find a few fs mounted >> in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to start the remaining >> osds (via ceph-osd@.service), but their fs isn't in the correct place, so >> this never works. >> >> The fix/workaround is to adjust the timeout value (edit the service file >> directly, or for style points write an override in /etc/systemd/system >> remembering you need a blank ExecStart line before your revised one). >> >> Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to >> start all its osds when started up with all fss dirty. So the current 120s >> is far too small (it's just about OK when all the osd fss are clean). >> >> I think, though, that having the timeout at all is a bug - if something >> needs to time out under some circumstances, should it be at a lower layer, >> perhaps? >> >> A couple of final points/asides, if I may: >> >> ceph-disk trigger uses subprocess.communicate (via the command() function), >> which means it swallows the log output from ceph-disk activate (and only >> outputs it after that process finishes) - as well as producing confusing >> timestamps, this means that when systemd kills the cgroup, all the output >> from the ceph-disk activate command vanishes into the void. That made >> debugging needlessly hard. Better to let called processes like that output >> immediately? >> >> Does each fs need mounting twice? could the osd be encoded in the partition >> label or similar instead? >> >> Is a single global activation lock necessary? It slows startup down quite a >> bit; I see no reason why (at least in the one-osd-per-disk case) you >> couldn't be activating all the osds at once... >> >> Regards, >> >> Matthew >> >> [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the >> timeout, so presumably upstart systems aren't affected >> [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, >> a charity registered in England with number 1021457 and a company registered >> in England with number 2742969, whose registered office is 215 Euston Road, >> London, NW1 2BE. _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com